multilingual
sea
nxphi47 commited on
Commit
7cf5394
·
1 Parent(s): 5fb2655

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -28
README.md CHANGED
@@ -46,7 +46,7 @@ The following sections summarize the technical specifications and performance ev
46
  ## Pre-training
47
 
48
  ### Vocabulary Expansion
49
- Like many English/Latin-dominant LLMs, Llama-2's BPE tokenizer breaks non-european and non-latin linguistic texts into unsustainably long byte-level sequences that cover much shorter semantic meanings, leading to degraded performance [(Nguyen et al., 2023)](https://arxiv.org/abs/2306.11372). For instance, it takes 4.3x more tokens to encode the same sentence in Thai compared to that in English. This leads to the models failing to perform summarization and comprehension tasks without exceeding the context length.
50
 
51
  Our goal for vocabulary expansion is threefold: (1) the number of newly-added tokens must be minimal and only cover the new languages, (2) the tokens should bring the compression ratios of new languages close to that of English, and (3) minimize the disruption of existing European tokens to preserve Llama-2 knowledge. In the end, we obtain **~11K** new tokens for Vi, Id, Th and Zh to augment the original 32000-token vocabulary. Details of our expansion technique will be revealed in our upcoming technical report.
52
 
@@ -63,6 +63,7 @@ As seen in the below table, our new vocabulary reduce the compression ratio from
63
 
64
  ### Pre-training Data
65
 
 
66
 
67
  ### Pre-training Strategies
68
 
@@ -82,15 +83,16 @@ More importantly, we engaged native speakers to collect a small amount of natura
82
 
83
  ### SFT Strategies
84
 
85
- We
86
 
87
 
88
  ## Evaluation
89
 
90
  ### Peer Comparison
91
 
92
- Evaluated by
93
 
 
94
  <!-- ! Add the stack chart better -->
95
  | vs ChatGPT | win | lose | tie
96
  | --- | --- | --- | --- |
@@ -100,18 +102,16 @@ Evaluated by
100
 
101
  ### M3Exam - World Knowledge in Regional Languages
102
 
103
- Introduction about the M3Exam
104
 
105
- <!-- | Qwen-7b-chat | 33.91 | 60.85 | 29.57 | 0.00 | 18.04
106
- | Qwen-13b-v3-pro | 75.30 | 89.27 | 56.68 | 49.46 | 39.35
107
- | Qwen-13b-v3-pro-SFT | 38.20 | 4.23 | 46.39 | 33.97 | 19.79
108
- | Qwen-14b | 75.56 | 88.78 | 54.61 | 49.97 | 42.62
109
- | Qwen-14b-SFT | 49.50 | 41.79 | 54.84 | 44.91 | 19.51 -->
110
 
111
- | M3-exam / 3-shot | En | Zh | Vi | Id | Th
 
 
 
112
  |-----------| ------- | ------- | ------- | ------- | ------- |
113
  | Random | 25.00 | 25.00 | 25.00 | 23.00 | 23.00
114
- | ChatGPT | 75.46 | 60.20 | 58.64 | ? | 37.41
115
  | Llama-2-13b | 59.88 | 43.40 | 41.70 | 34.80 | 23.18
116
  | Llama-2-13b-chat | 61.17 | 43.29 | 39.97 | 35.50 | 23.74
117
  | Polylm-13b-chat | 32.23 | 29.26 | 29.01 | 25.36 | 18.08
@@ -121,7 +121,7 @@ Introduction about the M3Exam
121
  | SeaLLM-13bChat/SFT/v2 | 62.35 | 45.81 | 49.92 | 40.04 | 36.49
122
 
123
 
124
- <!-- ! Considering removing zero-shot -->
125
  <!-- | Random | 25.00 | 25.00 | 25.00 | 23.00 | 23.00 -->
126
  <!-- | M3-exam / 0-shot | En | Zh | Vi | Id | Th
127
  |-----------| ------- | ------- | ------- | ------- | ------- |
@@ -137,21 +137,27 @@ Introduction about the M3Exam
137
 
138
  ### MMLU - Preserving English-based knowledge
139
 
140
- | 13B Models | STEM | Humanities | Social | Others | Average
 
 
141
  |-----------| ------- | ------- | ------- | ------- | ------- |
142
- | Llama-2 | 44.10 | 52.80 | 62.60 | 61.10 | 54.80
143
- | Llama-2-chat | 43.70 | 49.30 | 62.60 | 60.10 | 53.50
144
  | SeaLLM-13bChat/SFT/v2 | 43.67 | 52.09 | 62.69 | 61.20 | 54.70
145
  | SeaLLM-13bChat/SFT/v3 | 43.30 | 52.80 | 63.10 | 61.20 | 55.00
146
 
147
 
148
  ### NLP tasks
149
 
150
- #### Reading Comprehension (Xquad & IndoQA)
 
 
151
 
152
- 1-shot
153
 
154
- Read-Comphrension | En | Zh | Vi | Id | Th | ALL | SEA
 
 
155
  |-----------| ------- | ------- | ------- | ------- | ------- | ------- | ------- |
156
  | Llama-2-13b | 83.22 | 78.02 | 71.03 | 59.31 | 30.73 | 64.46 | 59.77
157
  | Llama-2-13b-chat | 80.46 | 70.54 | 62.87 | 63.05 | 25.73 | 60.93 | 51.21
@@ -161,28 +167,30 @@ Read-Comphrension | En | Zh | Vi | Id | Th | ALL | SEA
161
 
162
  #### Translation
163
 
164
- Translation between SEA-En. Scores in chrF++
 
 
165
 
166
- Model | En-Zh | En-Vi | En-Id | En-Th | En->X | Zh-En | Vi-En | Id-En | Th-En | X->En
167
  |-------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
168
  | Llama-2-13b | 24.36 | 53.20 | 60.41 | 22.16 | 45.26 | 53.20 | 59.10 | 63.42 | 38.48 | 53.55
169
  | Llama-2-13b-chat | 19.58 | 51.70 | 57.14 | 21.18 | 37.40 | 52.27 | 54.32 | 60.55 | 30.18 | 49.33
170
  | SeaLLM-13b-chat-v1 | 22.77 | 58.96 | 64.78 | 42.38 | 55.37 | 53.20 | 60.29 | 65.03 | 57.24 | 60.85
171
  | SeaLLM-13b-chat-v2 | 22.75 | 58.78 | 65.90 | 42.60 | 55.76 | 53.34 | 60.80 | 65.44 | 57.05 | 61.10
172
 
173
- Translation between SEA-SEA
174
 
175
- Model | Vi-Id | Id-Vi | Vi-Th | Th-Vi | Id-Th | Th-Id
176
  |-------- | ---- | ---- | ---- | ---- | ---- | ---- |
177
- ChatGPT | 56.75 | 54.17 | 40.48 | 46.54 | 40.59 | 51.87
178
- SeaLLM-13b-base mixed SFT | 54.56 | 54.76 | 36.68 | 51.88 | 39.36 | 47.99
179
- SeaLLM-13b-Chat/SFT/v2 | 53.75 | 52.47 | 32.76 | 49.20 | 40.43 | 50.03
180
 
181
  #### Summarization
182
 
183
- XL-sum - Rouge-L - 2shot
184
 
185
- XL-Summarization (rouge-L) | En | Zh | Vi | Id | Th
186
  |-------- | ---- | ---- | ---- | ---- | ---- |
187
  | Llama-2-13b | 32.57 | 34.37 | 18.61 | 25.14 | 16.91
188
  | Llama-2-13b-chat | 25.11 | 31.13 | 18.29 | 22.45 | 17.51
@@ -200,4 +208,3 @@ If you find our project useful, hope you can star our repo and cite our work as
200
  year = 2023,
201
  }
202
  ```
203
-
 
46
  ## Pre-training
47
 
48
  ### Vocabulary Expansion
49
+ Like many English/Latin-dominant LLMs, Llama-2's BPE tokenizer breaks non-european and non-latin linguistic texts into unsustainably long byte-level sequences that cover much shorter semantic meanings, leading to [degraded performance](https://arxiv.org/abs/2306.11372). For instance, it takes 4.3x more tokens to encode the same sentence in Thai compared to that in English. This leads to the models failing to perform summarization and comprehension tasks without exceeding the context length.
50
 
51
  Our goal for vocabulary expansion is threefold: (1) the number of newly-added tokens must be minimal and only cover the new languages, (2) the tokens should bring the compression ratios of new languages close to that of English, and (3) minimize the disruption of existing European tokens to preserve Llama-2 knowledge. In the end, we obtain **~11K** new tokens for Vi, Id, Th and Zh to augment the original 32000-token vocabulary. Details of our expansion technique will be revealed in our upcoming technical report.
52
 
 
63
 
64
  ### Pre-training Data
65
 
66
+ **Pending Lixin's**
67
 
68
  ### Pre-training Strategies
69
 
 
83
 
84
  ### SFT Strategies
85
 
86
+ We conduct SFT with a relatively balanced mix of SFT data from different categories. We make use of the system prompt during training, as we found it helps induce a prior which conditions the model to a behavioral distribution that focus on safety and usefulness.
87
 
88
 
89
  ## Evaluation
90
 
91
  ### Peer Comparison
92
 
93
+ One of the most reliable ways to compare chatbot models is peer comparison. With the help of native speakers, we built an instruction test set that focus on various aspects expected in a user-facing chatbot, namely (1) NLP tasks (e.g. translation & comprehension), (2) Reasoning, (3) Instruction-following and (4) Natural and Informal questions. The test set also covers all languages that we are concerned with.
94
 
95
+ **Pending peer comparison**
96
  <!-- ! Add the stack chart better -->
97
  | vs ChatGPT | win | lose | tie
98
  | --- | --- | --- | --- |
 
102
 
103
  ### M3Exam - World Knowledge in Regional Languages
104
 
 
105
 
106
+ [M3Exam](https://arxiv.org/pdf/2306.05179.pdf) is a collection of real-life and native official human exam questions benchmarks. This benchmark cover questions from multiple countries in the SEA region, which require strong multilingual proficiency and cultural knowledge across various critical educational periods, from primary- to high-school levels of difficulty.
 
 
 
 
107
 
108
+ As shown in the table, our SeaLLM model outperforms most 13B baselines and reaches closer to ChatGPT's performance. Notably, for Thai - a seemingly low-resource language, our model is just 1% behind ChatGPT despite the large size difference.
109
+
110
+
111
+ | M3Exam / 3-shot (Acc) | En | Zh | Vi | Id | Th
112
  |-----------| ------- | ------- | ------- | ------- | ------- |
113
  | Random | 25.00 | 25.00 | 25.00 | 23.00 | 23.00
114
+ | ChatGPT | 75.46 | 60.20 | 58.64 | 48+? | 37.41
115
  | Llama-2-13b | 59.88 | 43.40 | 41.70 | 34.80 | 23.18
116
  | Llama-2-13b-chat | 61.17 | 43.29 | 39.97 | 35.50 | 23.74
117
  | Polylm-13b-chat | 32.23 | 29.26 | 29.01 | 25.36 | 18.08
 
121
  | SeaLLM-13bChat/SFT/v2 | 62.35 | 45.81 | 49.92 | 40.04 | 36.49
122
 
123
 
124
+ <!-- ! Considering removing zero-shot from the main article -->
125
  <!-- | Random | 25.00 | 25.00 | 25.00 | 23.00 | 23.00 -->
126
  <!-- | M3-exam / 0-shot | En | Zh | Vi | Id | Th
127
  |-----------| ------- | ------- | ------- | ------- | ------- |
 
137
 
138
  ### MMLU - Preserving English-based knowledge
139
 
140
+ On the 5-shot [MMLU](https://arxiv.org/abs/2009.03300), our SeaLLM models not only preserve but also slightly outperform 13B LLama-2 and Llama-2-chat, despite the fact that we never intent to optimize for this English and Chinese dominant test set.
141
+
142
+ | MMLU (Acc) | STEM | Humanities | Social | Others | Average
143
  |-----------| ------- | ------- | ------- | ------- | ------- |
144
+ | Llama-2-13b | 44.10 | 52.80 | 62.60 | 61.10 | 54.80
145
+ | Llama-2-13b-chat | 43.70 | 49.30 | 62.60 | 60.10 | 53.50
146
  | SeaLLM-13bChat/SFT/v2 | 43.67 | 52.09 | 62.69 | 61.20 | 54.70
147
  | SeaLLM-13bChat/SFT/v3 | 43.30 | 52.80 | 63.10 | 61.20 | 55.00
148
 
149
 
150
  ### NLP tasks
151
 
152
+ We also test our models on many different NLP tasks.
153
+
154
+ #### Reading Comprehension (XQUAD & IndoQA)
155
 
156
+ [XQUAD](https://github.com/google-deepmind/xquad) is a popular multilingual variant of [SQUAD](https://www.aclweb.org/anthology/D16-1264/) benchmark, which evaluates models on reading comprehension ability. As XQUAD does not support Indonesian, we substitute it with [IndoQA](https://huggingface.co/datasets/jakartaresearch/indoqa), which was built for the same purpose.
157
 
158
+ As shown in the table below, the 1-shot reading comprehension performance is significantly better than Llama-2 for the SEA languages, while preserving the high performance in existing languages (En & Zh).
159
+
160
+ | XQUAD/IndoQA (F1) | En | Zh | Vi | Id | Th | ALL | SEA-lang
161
  |-----------| ------- | ------- | ------- | ------- | ------- | ------- | ------- |
162
  | Llama-2-13b | 83.22 | 78.02 | 71.03 | 59.31 | 30.73 | 64.46 | 59.77
163
  | Llama-2-13b-chat | 80.46 | 70.54 | 62.87 | 63.05 | 25.73 | 60.93 | 51.21
 
167
 
168
  #### Translation
169
 
170
+ For translation tasks, we evaluate our models with the [FloRes-200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md) using [chrF++](https://aclanthology.org/W15-3049/) scores in 4-shot settings.
171
+
172
+ Similarly observed, our SeaLLM models outperforms Llama-2 significantly in the new languages.
173
 
174
+ | FloRes-200 (chrF++) | En-Zh | En-Vi | En-Id | En-Th | En->X | Zh-En | Vi-En | Id-En | Th-En | X->En
175
  |-------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
176
  | Llama-2-13b | 24.36 | 53.20 | 60.41 | 22.16 | 45.26 | 53.20 | 59.10 | 63.42 | 38.48 | 53.55
177
  | Llama-2-13b-chat | 19.58 | 51.70 | 57.14 | 21.18 | 37.40 | 52.27 | 54.32 | 60.55 | 30.18 | 49.33
178
  | SeaLLM-13b-chat-v1 | 22.77 | 58.96 | 64.78 | 42.38 | 55.37 | 53.20 | 60.29 | 65.03 | 57.24 | 60.85
179
  | SeaLLM-13b-chat-v2 | 22.75 | 58.78 | 65.90 | 42.60 | 55.76 | 53.34 | 60.80 | 65.44 | 57.05 | 61.10
180
 
181
+ Our models are also performing competitively with ChatGPT for translation between SEA languages without English-pivoting.
182
 
183
+ | FloRes-200 (chrF++) | Vi-Id | Id-Vi | Vi-Th | Th-Vi | Id-Th | Th-Id
184
  |-------- | ---- | ---- | ---- | ---- | ---- | ---- |
185
+ | ChatGPT | 56.75 | 54.17 | 40.48 | 46.54 | 40.59 | 51.87
186
+ | SeaLLM-13b-base mixed SFT | 54.56 | 54.76 | 36.68 | 51.88 | 39.36 | 47.99
187
+ | SeaLLM-13b-Chat/SFT/v2 | 53.75 | 52.47 | 32.76 | 49.20 | 40.43 | 50.03
188
 
189
  #### Summarization
190
 
191
+ Lastly, in 2-shot [XL-sum summarization tasks](https://aclanthology.org/2021.findings-acl.413/), our models also achieve better performance, with substantial gains in Thai.
192
 
193
+ | XL-Sum (rouge-L) | En | Zh | Vi | Id | Th
194
  |-------- | ---- | ---- | ---- | ---- | ---- |
195
  | Llama-2-13b | 32.57 | 34.37 | 18.61 | 25.14 | 16.91
196
  | Llama-2-13b-chat | 25.11 | 31.13 | 18.29 | 22.45 | 17.51
 
208
  year = 2023,
209
  }
210
  ```