multilingual
sea
isakzhang commited on
Commit
531bb38
·
1 Parent(s): a987f1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -12
README.md CHANGED
@@ -8,6 +8,9 @@ language:
8
  - id
9
  - th
10
  - zh
 
 
 
11
  ---
12
  <p align="center">
13
  <img src="seal_logo.png" width="200" />
@@ -53,7 +56,7 @@ The following sections summarize the [Pre-training](#pre-training), [Supervised-
53
  ## Pre-training
54
 
55
  ### Vocabulary Expansion
56
- Like many English/Latin-dominant LLMs, Llama-2's BPE tokenizer breaks non-European and non-Latin linguistic texts into unsustainably long byte-level sequences that cover much shorter semantic meanings, leading to [degraded performance](https://arxiv.org/abs/2306.11372). For instance, it takes 4.3x more tokens to encode the same sentence in Thai compared to that in English. This leads to the models failing to perform summarization and comprehension tasks without exceeding the context length.
57
 
58
  Our goal for vocabulary expansion is threefold: (1) the number of newly-added tokens must be minimal and only cover the new languages, (2) the tokens should bring the compression ratios of new languages close to that of English, and (3) minimize the disruption of existing European tokens to preserve Llama-2 knowledge. In the end, we obtain **~11K** new tokens for Vi, Id, Th, and Zh to augment the original 32000-token vocabulary. Details of our expansion technique will be revealed in our upcoming technical report.
59
 
@@ -71,7 +74,7 @@ As seen in the table below, our new vocabulary reduces the compression ratio fro
71
  ### Pre-training Data
72
 
73
  The pre-training dataset of SeaLLMs is formed by the documents from diverse public sources, including web texts (e.g., [Common Crawl](https://commoncrawl.org/)),
74
- news documents (e.g., [CC-News](https://huggingface.co/datasets/cc_news)), academic articles and the texts with expert knowledge (e.g., wikipedia).
75
  We firstly employ [FastText language indentifier](https://huggingface.co/facebook/fasttext-language-identification) to filter out the documents that do not belong to Thai, Vietnamese or Indonesian.
76
  To further remove harmful or undesirable content, we develop a pipeline with various data cleaning and filtering modules to preprocess the collected data.
77
  Meanwhile, to maintain the English performance of SeaLLMs, we also introduce a set of high-quality English texts sampled from [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data) into pre-training.
@@ -88,7 +91,7 @@ We pre-train our SeaLLM-base in ~4 weeks on 32gpus, clocking ~150B tokens. We us
88
 
89
  ### SFT Data
90
 
91
- Our supervised finetuning (SFT) data consists of many categories. The largest of them are public and open-source, such as [OpenORCA](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus). As the aforementioned are monolingual, we employ several established or novel automatic techniques to gather more instruction data for SEA languages.
92
 
93
  Even more noteworthy is that we engaged native speakers to collect a small number of queries used by SEA-language native speakers in natural settings, which helps in adaptation to the local cultural customs, norms, and laws. We also collect country-relevant safety data that cover many culturally and legally sensitive topics in each of these SEA countries - such data tend to be ignored, or may even appear in conflict with Western safety data. Therefore, we believe that our models are more local-friendly and abide by local rules to a higher degree.
94
 
@@ -103,8 +106,12 @@ We conduct SFT with a relatively balanced mix of SFT data from different categor
103
 
104
  One of the most reliable ways to compare chatbot models is peer comparison.
105
  With the help of native speakers, we built an instruction test set that focuses on various aspects expected in a user-facing chatbot, namely:
106
- (1) NLP tasks (e.g. translation & comprehension), (2) Reasoning, (3) Instruction-following and
107
- (4) Natural and Informal questions. The test set also covers all languages that we are concerned with.
 
 
 
 
108
  We use GPT-4 as an evaluator to rate the comparison between our models versus ChatGPT-3.5 and other baselines.
109
 
110
  Compared with [PolyLM-13b-chat](https://arxiv.org/pdf/2307.06018.pdf), a recent multilingual model, our model significantly outperforms across all languages and categories.
@@ -117,7 +124,7 @@ Compared with [PolyLM-13b-chat](https://arxiv.org/pdf/2307.06018.pdf), a recent
117
 
118
  Compared with Llama-2-13b-chat, our SeaLLM-13b performs significantly better in all SEA languages,
119
  despite the fact that Llama-2 was already trained on a decent data amount of Vi, Id, and Th.
120
- In english, our model is 46% as good as Llama-2-13b-chat, even though it did not undergo complex human-labor intensive RLHF.
121
 
122
 
123
 
@@ -127,7 +134,7 @@ In english, our model is 46% as good as Llama-2-13b-chat, even though it did not
127
  </div>
128
 
129
  Compared with ChatGPT-3.5, our SeaLLM-13b model is performing 45% as good as ChatGPT for Thai.
130
- For important aspects such as Safety and Task-Solving, our model nearly on par with ChatGPT across the languages.
131
 
132
 
133
  <div class="row" style="display: flex; clear: both;">
@@ -156,7 +163,7 @@ Notably, for Thai - a seemingly low-resource language, our model is just 1% behi
156
 
157
  ### MMLU - Preserving English-based knowledge
158
 
159
- On the 5-shot [MMLU](https://arxiv.org/abs/2009.03300), our SeaLLM models not only preserve but also slightly outperform 13B LLama-2 and Llama-2-chat, despite the fact that optimizing for this English and Chinese dominant test set is not part of our goal.
160
 
161
  | MMLU (Acc) | STEM | Humanities | Social | Others | Average
162
  |-----------| ------- | ------- | ------- | ------- | ------- |
@@ -186,7 +193,7 @@ As shown in the table below, the 1-shot reading comprehension performance is sig
186
 
187
  For translation tasks, we evaluate our models with the [FloRes-200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md) using [chrF++](https://aclanthology.org/W15-3049/) scores in 4-shot settings.
188
 
189
- Similarly observed, our SeaLLM models outperform Llama-2 significantly in the new languages.
190
 
191
 
192
  | FloRes-200 (chrF++) | En-Zh | En-Vi | En-Id | En-Th | En->X | Zh-En | Vi-En | Id-En | Th-En | X->En
@@ -204,7 +211,7 @@ Our models are also performing competitively with ChatGPT for translation betwee
204
 
205
  #### Summarization
206
 
207
- Lastly, in 2-shot [XL-sum summarization tasks](https://aclanthology.org/2021.findings-acl.413/), our models also achieve a better performance, with substantial gains in Thai.
208
 
209
  | XL-Sum (rouge-L) | En | Zh | Vi | Id | Th
210
  |-------- | ---- | ---- | ---- | ---- | ---- |
@@ -214,7 +221,7 @@ Lastly, in 2-shot [XL-sum summarization tasks](https://aclanthology.org/2021.fin
214
 
215
  ## Acknowledge our linguists
216
 
217
- We would like to express our special thanks to our professional and native linguists, who helped build, evaluate and fact-check our supervised finetuning (SFT) dataset as well as evaluating our models across different aspects, especially safety.
218
 
219
  ## Citation
220
 
@@ -228,4 +235,4 @@ If you find our project useful, hope you can star our repo and cite our work as
228
  title = {SeaLLMs - Large Language Models for Southeast Asia},
229
  year = 2023,
230
  }
231
- ```
 
8
  - id
9
  - th
10
  - zh
11
+ tags:
12
+ - multilingual
13
+ - sea
14
  ---
15
  <p align="center">
16
  <img src="seal_logo.png" width="200" />
 
56
  ## Pre-training
57
 
58
  ### Vocabulary Expansion
59
+ Like many English/Latin-dominant LLMs, Llama-2's BPE tokenizer breaks non-European and non-Latin linguistic texts into unsustainably long byte-level sequences that cover much shorter semantic meanings, leading to [degraded performance](https://arxiv.org/abs/2306.11372). For instance, it takes 4.3x more tokens to encode the same sentence in Thai compared to that in English (see below table). This leads to the models failing to perform tasks requiring long context modeling (e.g., summarization and comprehension tasks) without exceeding the context length.
60
 
61
  Our goal for vocabulary expansion is threefold: (1) the number of newly-added tokens must be minimal and only cover the new languages, (2) the tokens should bring the compression ratios of new languages close to that of English, and (3) minimize the disruption of existing European tokens to preserve Llama-2 knowledge. In the end, we obtain **~11K** new tokens for Vi, Id, Th, and Zh to augment the original 32000-token vocabulary. Details of our expansion technique will be revealed in our upcoming technical report.
62
 
 
74
  ### Pre-training Data
75
 
76
  The pre-training dataset of SeaLLMs is formed by the documents from diverse public sources, including web texts (e.g., [Common Crawl](https://commoncrawl.org/)),
77
+ news documents (e.g., [CC-News](https://huggingface.co/datasets/cc_news)), academic articles, and texts with expert knowledge (e.g., Wikipedia).
78
  We firstly employ [FastText language indentifier](https://huggingface.co/facebook/fasttext-language-identification) to filter out the documents that do not belong to Thai, Vietnamese or Indonesian.
79
  To further remove harmful or undesirable content, we develop a pipeline with various data cleaning and filtering modules to preprocess the collected data.
80
  Meanwhile, to maintain the English performance of SeaLLMs, we also introduce a set of high-quality English texts sampled from [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data) into pre-training.
 
91
 
92
  ### SFT Data
93
 
94
+ Our supervised finetuning (SFT) data consists of many categories. The largest of them are public and open-source, such as [OpenORCA](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus). As the aforementioned are English only, we employ several established or novel automatic techniques to gather more instruction data for SEA languages.
95
 
96
  Even more noteworthy is that we engaged native speakers to collect a small number of queries used by SEA-language native speakers in natural settings, which helps in adaptation to the local cultural customs, norms, and laws. We also collect country-relevant safety data that cover many culturally and legally sensitive topics in each of these SEA countries - such data tend to be ignored, or may even appear in conflict with Western safety data. Therefore, we believe that our models are more local-friendly and abide by local rules to a higher degree.
97
 
 
106
 
107
  One of the most reliable ways to compare chatbot models is peer comparison.
108
  With the help of native speakers, we built an instruction test set that focuses on various aspects expected in a user-facing chatbot, namely:
109
+ (1) task-solving (e.g. translation & comprehension),
110
+ (2) math-reasoning (e.g., math and logical reasoning questions),
111
+ (3) general-instruction (e.g., instructions in general domains),
112
+ (4) natural-questions (e.g., questions about local context often written informally), and
113
+ (5) safety-related questions.
114
+ The test set also covers all languages that we are concerned with.
115
  We use GPT-4 as an evaluator to rate the comparison between our models versus ChatGPT-3.5 and other baselines.
116
 
117
  Compared with [PolyLM-13b-chat](https://arxiv.org/pdf/2307.06018.pdf), a recent multilingual model, our model significantly outperforms across all languages and categories.
 
124
 
125
  Compared with Llama-2-13b-chat, our SeaLLM-13b performs significantly better in all SEA languages,
126
  despite the fact that Llama-2 was already trained on a decent data amount of Vi, Id, and Th.
127
+ In English, our model is 46% as good as Llama-2-13b-chat, even though it did not undergo complex human-labor intensive RLHF.
128
 
129
 
130
 
 
134
  </div>
135
 
136
  Compared with ChatGPT-3.5, our SeaLLM-13b model is performing 45% as good as ChatGPT for Thai.
137
+ For important aspects such as Safety and Task-Solving, our model is nearly on par with ChatGPT across the languages.
138
 
139
 
140
  <div class="row" style="display: flex; clear: both;">
 
163
 
164
  ### MMLU - Preserving English-based knowledge
165
 
166
+ On the 5-shot [MMLU](https://arxiv.org/abs/2009.03300), our SeaLLM models not only preserve but also slightly outperform 13B LLama-2 and Llama-2-chat, despite the fact that optimizing for this English dominant test set is not part of our goal.
167
 
168
  | MMLU (Acc) | STEM | Humanities | Social | Others | Average
169
  |-----------| ------- | ------- | ------- | ------- | ------- |
 
193
 
194
  For translation tasks, we evaluate our models with the [FloRes-200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md) using [chrF++](https://aclanthology.org/W15-3049/) scores in 4-shot settings.
195
 
196
+ Similarly observed, our SeaLLM model outperforms Llama-2 significantly in the new languages.
197
 
198
 
199
  | FloRes-200 (chrF++) | En-Zh | En-Vi | En-Id | En-Th | En->X | Zh-En | Vi-En | Id-En | Th-En | X->En
 
211
 
212
  #### Summarization
213
 
214
+ Lastly, in 2-shot [XL-sum summarization tasks](https://aclanthology.org/2021.findings-acl.413/), our model also achieves better performance, with substantial gains in Thai.
215
 
216
  | XL-Sum (rouge-L) | En | Zh | Vi | Id | Th
217
  |-------- | ---- | ---- | ---- | ---- | ---- |
 
221
 
222
  ## Acknowledge our linguists
223
 
224
+ We would like to express our special thanks to our professional and native linguists, who helped build, evaluate, and fact-check our sampled pretraining and SFT dataset as well as evaluating our models across different aspects, especially safety.
225
 
226
  ## Citation
227
 
 
235
  title = {SeaLLMs - Large Language Models for Southeast Asia},
236
  year = 2023,
237
  }
238
+ ```