Update README.md
Browse files
README.md
CHANGED
@@ -93,11 +93,12 @@ We conduct SFT with a relatively balanced mix of SFT data from different categor
|
|
93 |
|
94 |
### Peer Comparison
|
95 |
|
96 |
-
One of the most reliable ways to compare chatbot models is peer comparison.
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
|
|
|
101 |
|
102 |
<img src="seallm_vs_chatgpt_by_lang.png" width="800" />
|
103 |
|
@@ -127,19 +128,6 @@ As shown in the table, our SeaLLM model outperforms most 13B baselines and reach
|
|
127 |
| SeaLLM-13bChat/SFT/v2 | 62.35 | 45.81 | 49.92 | 40.04 | 36.49
|
128 |
|
129 |
|
130 |
-
<!-- ! Considering removing zero-shot from the main article -->
|
131 |
-
<!-- | Random | 25.00 | 25.00 | 25.00 | 23.00 | 23.00 -->
|
132 |
-
<!-- | M3-exam / 0-shot | En | Zh | Vi | Id | Th
|
133 |
-
|-----------| ------- | ------- | ------- | ------- | ------- |
|
134 |
-
| ChatGPT | 75.98 | 61.00 | 57.18 | 48.58 | 34.09
|
135 |
-
| Llama-2-13b | 19.49 | 39.07 | 35.38 | 23.66 | 12.44
|
136 |
-
| Llama-2-13b-chat | 52.57 | 39.52 | 36.56 | 27.39 | 10.40
|
137 |
-
| Polylm-13b-chat | 28.74 | 27.71 | 25.77 | 22.01 | 13.65
|
138 |
-
| Qwen-PolyLM-7b-chat | 52.51 | 56.14 | 32.34 | 25.49 | 24.64
|
139 |
-
| SeaLLM-13b/78k-step | 36.68 | 36.58 | 41.98 | 25.87 | 20.11
|
140 |
-
| SeaLLM-13bChat/SFT/v1 | 64.30 | 45.58 | 48.13 | 37.76 | 30.77
|
141 |
-
| SeaLLM-13bChat/SFT/v2 | 62.23 | 41.00 | 47.23 | 35.10 | 30.77 -->
|
142 |
-
|
143 |
|
144 |
### MMLU - Preserving English-based knowledge
|
145 |
|
|
|
93 |
|
94 |
### Peer Comparison
|
95 |
|
96 |
+
One of the most reliable ways to compare chatbot models is peer comparison.
|
97 |
+
With the help of native speakers, we built an instruction test set that focus on various aspects expected in a user-facing chatbot, namely"
|
98 |
+
(1) NLP tasks (e.g. translation & comprehension), (2) Reasoning, (3) Instruction-following and
|
99 |
+
(4) Natural and Informal questions. The test set also covers all languages that we are concerned with.
|
100 |
|
101 |
+
We use GPT-4 as an evaluator to rate the comparison between our models versus ChatGPT-3.5 and other baselines.
|
102 |
|
103 |
<img src="seallm_vs_chatgpt_by_lang.png" width="800" />
|
104 |
|
|
|
128 |
| SeaLLM-13bChat/SFT/v2 | 62.35 | 45.81 | 49.92 | 40.04 | 36.49
|
129 |
|
130 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
131 |
|
132 |
### MMLU - Preserving English-based knowledge
|
133 |
|