multilingual
sea
nxphi47 commited on
Commit
a5246b0
·
1 Parent(s): 0e1abcd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -17
README.md CHANGED
@@ -93,11 +93,12 @@ We conduct SFT with a relatively balanced mix of SFT data from different categor
93
 
94
  ### Peer Comparison
95
 
96
- One of the most reliable ways to compare chatbot models is peer comparison. With the help of native speakers, we built an instruction test set that focus on various aspects expected in a user-facing chatbot, namely (1) NLP tasks (e.g. translation & comprehension), (2) Reasoning, (3) Instruction-following and (4) Natural and Informal questions. The test set also covers all languages that we are concerned with.
97
-
98
- **Pending peer comparison**
99
-
100
 
 
101
 
102
  <img src="seallm_vs_chatgpt_by_lang.png" width="800" />
103
 
@@ -127,19 +128,6 @@ As shown in the table, our SeaLLM model outperforms most 13B baselines and reach
127
  | SeaLLM-13bChat/SFT/v2 | 62.35 | 45.81 | 49.92 | 40.04 | 36.49
128
 
129
 
130
- <!-- ! Considering removing zero-shot from the main article -->
131
- <!-- | Random | 25.00 | 25.00 | 25.00 | 23.00 | 23.00 -->
132
- <!-- | M3-exam / 0-shot | En | Zh | Vi | Id | Th
133
- |-----------| ------- | ------- | ------- | ------- | ------- |
134
- | ChatGPT | 75.98 | 61.00 | 57.18 | 48.58 | 34.09
135
- | Llama-2-13b | 19.49 | 39.07 | 35.38 | 23.66 | 12.44
136
- | Llama-2-13b-chat | 52.57 | 39.52 | 36.56 | 27.39 | 10.40
137
- | Polylm-13b-chat | 28.74 | 27.71 | 25.77 | 22.01 | 13.65
138
- | Qwen-PolyLM-7b-chat | 52.51 | 56.14 | 32.34 | 25.49 | 24.64
139
- | SeaLLM-13b/78k-step | 36.68 | 36.58 | 41.98 | 25.87 | 20.11
140
- | SeaLLM-13bChat/SFT/v1 | 64.30 | 45.58 | 48.13 | 37.76 | 30.77
141
- | SeaLLM-13bChat/SFT/v2 | 62.23 | 41.00 | 47.23 | 35.10 | 30.77 -->
142
-
143
 
144
  ### MMLU - Preserving English-based knowledge
145
 
 
93
 
94
  ### Peer Comparison
95
 
96
+ One of the most reliable ways to compare chatbot models is peer comparison.
97
+ With the help of native speakers, we built an instruction test set that focus on various aspects expected in a user-facing chatbot, namely"
98
+ (1) NLP tasks (e.g. translation & comprehension), (2) Reasoning, (3) Instruction-following and
99
+ (4) Natural and Informal questions. The test set also covers all languages that we are concerned with.
100
 
101
+ We use GPT-4 as an evaluator to rate the comparison between our models versus ChatGPT-3.5 and other baselines.
102
 
103
  <img src="seallm_vs_chatgpt_by_lang.png" width="800" />
104
 
 
128
  | SeaLLM-13bChat/SFT/v2 | 62.35 | 45.81 | 49.92 | 40.04 | 36.49
129
 
130
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
 
132
  ### MMLU - Preserving English-based knowledge
133