Update README.md
Browse files
README.md
CHANGED
@@ -46,7 +46,7 @@ The following sections summarize the technical specifications and performance ev
|
|
46 |
## Pre-training
|
47 |
|
48 |
### Vocabulary Expansion
|
49 |
-
Like many English/Latin-dominant LLMs, Llama-2's BPE tokenizer breaks non-european and non-latin linguistic texts into unsustainably long byte-level sequences that cover much shorter semantic meanings, leading to degraded performance
|
50 |
|
51 |
Our goal for vocabulary expansion is threefold: (1) the number of newly-added tokens must be minimal and only cover the new languages, (2) the tokens should bring the compression ratios of new languages close to that of English, and (3) minimize the disruption of existing European tokens to preserve Llama-2 knowledge. In the end, we obtain **~11K** new tokens for Vi, Id, Th and Zh to augment the original 32000-token vocabulary. Details of our expansion technique will be revealed in our upcoming technical report.
|
52 |
|
@@ -63,6 +63,7 @@ As seen in the below table, our new vocabulary reduce the compression ratio from
|
|
63 |
|
64 |
### Pre-training Data
|
65 |
|
|
|
66 |
|
67 |
### Pre-training Strategies
|
68 |
|
@@ -82,15 +83,16 @@ More importantly, we engaged native speakers to collect a small amount of natura
|
|
82 |
|
83 |
### SFT Strategies
|
84 |
|
85 |
-
We
|
86 |
|
87 |
|
88 |
## Evaluation
|
89 |
|
90 |
### Peer Comparison
|
91 |
|
92 |
-
|
93 |
|
|
|
94 |
<!-- ! Add the stack chart better -->
|
95 |
| vs ChatGPT | win | lose | tie
|
96 |
| --- | --- | --- | --- |
|
@@ -100,18 +102,16 @@ Evaluated by
|
|
100 |
|
101 |
### M3Exam - World Knowledge in Regional Languages
|
102 |
|
103 |
-
Introduction about the M3Exam
|
104 |
|
105 |
-
|
106 |
-
| Qwen-13b-v3-pro | 75.30 | 89.27 | 56.68 | 49.46 | 39.35
|
107 |
-
| Qwen-13b-v3-pro-SFT | 38.20 | 4.23 | 46.39 | 33.97 | 19.79
|
108 |
-
| Qwen-14b | 75.56 | 88.78 | 54.61 | 49.97 | 42.62
|
109 |
-
| Qwen-14b-SFT | 49.50 | 41.79 | 54.84 | 44.91 | 19.51 -->
|
110 |
|
111 |
-
|
|
|
|
|
|
|
112 |
|-----------| ------- | ------- | ------- | ------- | ------- |
|
113 |
| Random | 25.00 | 25.00 | 25.00 | 23.00 | 23.00
|
114 |
-
| ChatGPT | 75.46 | 60.20 | 58.64 |
|
115 |
| Llama-2-13b | 59.88 | 43.40 | 41.70 | 34.80 | 23.18
|
116 |
| Llama-2-13b-chat | 61.17 | 43.29 | 39.97 | 35.50 | 23.74
|
117 |
| Polylm-13b-chat | 32.23 | 29.26 | 29.01 | 25.36 | 18.08
|
@@ -121,7 +121,7 @@ Introduction about the M3Exam
|
|
121 |
| SeaLLM-13bChat/SFT/v2 | 62.35 | 45.81 | 49.92 | 40.04 | 36.49
|
122 |
|
123 |
|
124 |
-
<!-- ! Considering removing zero-shot -->
|
125 |
<!-- | Random | 25.00 | 25.00 | 25.00 | 23.00 | 23.00 -->
|
126 |
<!-- | M3-exam / 0-shot | En | Zh | Vi | Id | Th
|
127 |
|-----------| ------- | ------- | ------- | ------- | ------- |
|
@@ -137,21 +137,27 @@ Introduction about the M3Exam
|
|
137 |
|
138 |
### MMLU - Preserving English-based knowledge
|
139 |
|
140 |
-
|
|
|
|
|
141 |
|-----------| ------- | ------- | ------- | ------- | ------- |
|
142 |
-
| Llama-2
|
143 |
-
| Llama-2-chat
|
144 |
| SeaLLM-13bChat/SFT/v2 | 43.67 | 52.09 | 62.69 | 61.20 | 54.70
|
145 |
| SeaLLM-13bChat/SFT/v3 | 43.30 | 52.80 | 63.10 | 61.20 | 55.00
|
146 |
|
147 |
|
148 |
### NLP tasks
|
149 |
|
150 |
-
|
|
|
|
|
151 |
|
152 |
-
|
153 |
|
154 |
-
|
|
|
|
|
155 |
|-----------| ------- | ------- | ------- | ------- | ------- | ------- | ------- |
|
156 |
| Llama-2-13b | 83.22 | 78.02 | 71.03 | 59.31 | 30.73 | 64.46 | 59.77
|
157 |
| Llama-2-13b-chat | 80.46 | 70.54 | 62.87 | 63.05 | 25.73 | 60.93 | 51.21
|
@@ -161,28 +167,30 @@ Read-Comphrension | En | Zh | Vi | Id | Th | ALL | SEA
|
|
161 |
|
162 |
#### Translation
|
163 |
|
164 |
-
|
|
|
|
|
165 |
|
166 |
-
|
167 |
|-------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
168 |
| Llama-2-13b | 24.36 | 53.20 | 60.41 | 22.16 | 45.26 | 53.20 | 59.10 | 63.42 | 38.48 | 53.55
|
169 |
| Llama-2-13b-chat | 19.58 | 51.70 | 57.14 | 21.18 | 37.40 | 52.27 | 54.32 | 60.55 | 30.18 | 49.33
|
170 |
| SeaLLM-13b-chat-v1 | 22.77 | 58.96 | 64.78 | 42.38 | 55.37 | 53.20 | 60.29 | 65.03 | 57.24 | 60.85
|
171 |
| SeaLLM-13b-chat-v2 | 22.75 | 58.78 | 65.90 | 42.60 | 55.76 | 53.34 | 60.80 | 65.44 | 57.05 | 61.10
|
172 |
|
173 |
-
|
174 |
|
175 |
-
|
176 |
|-------- | ---- | ---- | ---- | ---- | ---- | ---- |
|
177 |
-
ChatGPT | 56.75 | 54.17 | 40.48 | 46.54 | 40.59 | 51.87
|
178 |
-
SeaLLM-13b-base mixed SFT | 54.56 | 54.76 | 36.68 | 51.88 | 39.36 | 47.99
|
179 |
-
SeaLLM-13b-Chat/SFT/v2 | 53.75 | 52.47 | 32.76 | 49.20 | 40.43 | 50.03
|
180 |
|
181 |
#### Summarization
|
182 |
|
183 |
-
XL-sum
|
184 |
|
185 |
-
XL-
|
186 |
|-------- | ---- | ---- | ---- | ---- | ---- |
|
187 |
| Llama-2-13b | 32.57 | 34.37 | 18.61 | 25.14 | 16.91
|
188 |
| Llama-2-13b-chat | 25.11 | 31.13 | 18.29 | 22.45 | 17.51
|
@@ -200,4 +208,3 @@ If you find our project useful, hope you can star our repo and cite our work as
|
|
200 |
year = 2023,
|
201 |
}
|
202 |
```
|
203 |
-
|
|
|
46 |
## Pre-training
|
47 |
|
48 |
### Vocabulary Expansion
|
49 |
+
Like many English/Latin-dominant LLMs, Llama-2's BPE tokenizer breaks non-european and non-latin linguistic texts into unsustainably long byte-level sequences that cover much shorter semantic meanings, leading to [degraded performance](https://arxiv.org/abs/2306.11372). For instance, it takes 4.3x more tokens to encode the same sentence in Thai compared to that in English. This leads to the models failing to perform summarization and comprehension tasks without exceeding the context length.
|
50 |
|
51 |
Our goal for vocabulary expansion is threefold: (1) the number of newly-added tokens must be minimal and only cover the new languages, (2) the tokens should bring the compression ratios of new languages close to that of English, and (3) minimize the disruption of existing European tokens to preserve Llama-2 knowledge. In the end, we obtain **~11K** new tokens for Vi, Id, Th and Zh to augment the original 32000-token vocabulary. Details of our expansion technique will be revealed in our upcoming technical report.
|
52 |
|
|
|
63 |
|
64 |
### Pre-training Data
|
65 |
|
66 |
+
**Pending Lixin's**
|
67 |
|
68 |
### Pre-training Strategies
|
69 |
|
|
|
83 |
|
84 |
### SFT Strategies
|
85 |
|
86 |
+
We conduct SFT with a relatively balanced mix of SFT data from different categories. We make use of the system prompt during training, as we found it helps induce a prior which conditions the model to a behavioral distribution that focus on safety and usefulness.
|
87 |
|
88 |
|
89 |
## Evaluation
|
90 |
|
91 |
### Peer Comparison
|
92 |
|
93 |
+
One of the most reliable ways to compare chatbot models is peer comparison. With the help of native speakers, we built an instruction test set that focus on various aspects expected in a user-facing chatbot, namely (1) NLP tasks (e.g. translation & comprehension), (2) Reasoning, (3) Instruction-following and (4) Natural and Informal questions. The test set also covers all languages that we are concerned with.
|
94 |
|
95 |
+
**Pending peer comparison**
|
96 |
<!-- ! Add the stack chart better -->
|
97 |
| vs ChatGPT | win | lose | tie
|
98 |
| --- | --- | --- | --- |
|
|
|
102 |
|
103 |
### M3Exam - World Knowledge in Regional Languages
|
104 |
|
|
|
105 |
|
106 |
+
[M3Exam](https://arxiv.org/pdf/2306.05179.pdf) is a collection of real-life and native official human exam questions benchmarks. This benchmark cover questions from multiple countries in the SEA region, which require strong multilingual proficiency and cultural knowledge across various critical educational periods, from primary- to high-school levels of difficulty.
|
|
|
|
|
|
|
|
|
107 |
|
108 |
+
As shown in the table, our SeaLLM model outperforms most 13B baselines and reaches closer to ChatGPT's performance. Notably, for Thai - a seemingly low-resource language, our model is just 1% behind ChatGPT despite the large size difference.
|
109 |
+
|
110 |
+
|
111 |
+
| M3Exam / 3-shot (Acc) | En | Zh | Vi | Id | Th
|
112 |
|-----------| ------- | ------- | ------- | ------- | ------- |
|
113 |
| Random | 25.00 | 25.00 | 25.00 | 23.00 | 23.00
|
114 |
+
| ChatGPT | 75.46 | 60.20 | 58.64 | 48+? | 37.41
|
115 |
| Llama-2-13b | 59.88 | 43.40 | 41.70 | 34.80 | 23.18
|
116 |
| Llama-2-13b-chat | 61.17 | 43.29 | 39.97 | 35.50 | 23.74
|
117 |
| Polylm-13b-chat | 32.23 | 29.26 | 29.01 | 25.36 | 18.08
|
|
|
121 |
| SeaLLM-13bChat/SFT/v2 | 62.35 | 45.81 | 49.92 | 40.04 | 36.49
|
122 |
|
123 |
|
124 |
+
<!-- ! Considering removing zero-shot from the main article -->
|
125 |
<!-- | Random | 25.00 | 25.00 | 25.00 | 23.00 | 23.00 -->
|
126 |
<!-- | M3-exam / 0-shot | En | Zh | Vi | Id | Th
|
127 |
|-----------| ------- | ------- | ------- | ------- | ------- |
|
|
|
137 |
|
138 |
### MMLU - Preserving English-based knowledge
|
139 |
|
140 |
+
On the 5-shot [MMLU](https://arxiv.org/abs/2009.03300), our SeaLLM models not only preserve but also slightly outperform 13B LLama-2 and Llama-2-chat, despite the fact that we never intent to optimize for this English and Chinese dominant test set.
|
141 |
+
|
142 |
+
| MMLU (Acc) | STEM | Humanities | Social | Others | Average
|
143 |
|-----------| ------- | ------- | ------- | ------- | ------- |
|
144 |
+
| Llama-2-13b | 44.10 | 52.80 | 62.60 | 61.10 | 54.80
|
145 |
+
| Llama-2-13b-chat | 43.70 | 49.30 | 62.60 | 60.10 | 53.50
|
146 |
| SeaLLM-13bChat/SFT/v2 | 43.67 | 52.09 | 62.69 | 61.20 | 54.70
|
147 |
| SeaLLM-13bChat/SFT/v3 | 43.30 | 52.80 | 63.10 | 61.20 | 55.00
|
148 |
|
149 |
|
150 |
### NLP tasks
|
151 |
|
152 |
+
We also test our models on many different NLP tasks.
|
153 |
+
|
154 |
+
#### Reading Comprehension (XQUAD & IndoQA)
|
155 |
|
156 |
+
[XQUAD](https://github.com/google-deepmind/xquad) is a popular multilingual variant of [SQUAD](https://www.aclweb.org/anthology/D16-1264/) benchmark, which evaluates models on reading comprehension ability. As XQUAD does not support Indonesian, we substitute it with [IndoQA](https://huggingface.co/datasets/jakartaresearch/indoqa), which was built for the same purpose.
|
157 |
|
158 |
+
As shown in the table below, the 1-shot reading comprehension performance is significantly better than Llama-2 for the SEA languages, while preserving the high performance in existing languages (En & Zh).
|
159 |
+
|
160 |
+
| XQUAD/IndoQA (F1) | En | Zh | Vi | Id | Th | ALL | SEA-lang
|
161 |
|-----------| ------- | ------- | ------- | ------- | ------- | ------- | ------- |
|
162 |
| Llama-2-13b | 83.22 | 78.02 | 71.03 | 59.31 | 30.73 | 64.46 | 59.77
|
163 |
| Llama-2-13b-chat | 80.46 | 70.54 | 62.87 | 63.05 | 25.73 | 60.93 | 51.21
|
|
|
167 |
|
168 |
#### Translation
|
169 |
|
170 |
+
For translation tasks, we evaluate our models with the [FloRes-200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md) using [chrF++](https://aclanthology.org/W15-3049/) scores in 4-shot settings.
|
171 |
+
|
172 |
+
Similarly observed, our SeaLLM models outperforms Llama-2 significantly in the new languages.
|
173 |
|
174 |
+
| FloRes-200 (chrF++) | En-Zh | En-Vi | En-Id | En-Th | En->X | Zh-En | Vi-En | Id-En | Th-En | X->En
|
175 |
|-------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
176 |
| Llama-2-13b | 24.36 | 53.20 | 60.41 | 22.16 | 45.26 | 53.20 | 59.10 | 63.42 | 38.48 | 53.55
|
177 |
| Llama-2-13b-chat | 19.58 | 51.70 | 57.14 | 21.18 | 37.40 | 52.27 | 54.32 | 60.55 | 30.18 | 49.33
|
178 |
| SeaLLM-13b-chat-v1 | 22.77 | 58.96 | 64.78 | 42.38 | 55.37 | 53.20 | 60.29 | 65.03 | 57.24 | 60.85
|
179 |
| SeaLLM-13b-chat-v2 | 22.75 | 58.78 | 65.90 | 42.60 | 55.76 | 53.34 | 60.80 | 65.44 | 57.05 | 61.10
|
180 |
|
181 |
+
Our models are also performing competitively with ChatGPT for translation between SEA languages without English-pivoting.
|
182 |
|
183 |
+
| FloRes-200 (chrF++) | Vi-Id | Id-Vi | Vi-Th | Th-Vi | Id-Th | Th-Id
|
184 |
|-------- | ---- | ---- | ---- | ---- | ---- | ---- |
|
185 |
+
| ChatGPT | 56.75 | 54.17 | 40.48 | 46.54 | 40.59 | 51.87
|
186 |
+
| SeaLLM-13b-base mixed SFT | 54.56 | 54.76 | 36.68 | 51.88 | 39.36 | 47.99
|
187 |
+
| SeaLLM-13b-Chat/SFT/v2 | 53.75 | 52.47 | 32.76 | 49.20 | 40.43 | 50.03
|
188 |
|
189 |
#### Summarization
|
190 |
|
191 |
+
Lastly, in 2-shot [XL-sum summarization tasks](https://aclanthology.org/2021.findings-acl.413/), our models also achieve better performance, with substantial gains in Thai.
|
192 |
|
193 |
+
| XL-Sum (rouge-L) | En | Zh | Vi | Id | Th
|
194 |
|-------- | ---- | ---- | ---- | ---- | ---- |
|
195 |
| Llama-2-13b | 32.57 | 34.37 | 18.61 | 25.14 | 16.91
|
196 |
| Llama-2-13b-chat | 25.11 | 31.13 | 18.29 | 22.45 | 17.51
|
|
|
208 |
year = 2023,
|
209 |
}
|
210 |
```
|
|