Update README.md
Browse files
README.md
CHANGED
@@ -8,6 +8,9 @@ language:
|
|
8 |
- id
|
9 |
- th
|
10 |
- zh
|
|
|
|
|
|
|
11 |
---
|
12 |
<p align="center">
|
13 |
<img src="seal_logo.png" width="200" />
|
@@ -53,7 +56,7 @@ The following sections summarize the [Pre-training](#pre-training), [Supervised-
|
|
53 |
## Pre-training
|
54 |
|
55 |
### Vocabulary Expansion
|
56 |
-
Like many English/Latin-dominant LLMs, Llama-2's BPE tokenizer breaks non-European and non-Latin linguistic texts into unsustainably long byte-level sequences that cover much shorter semantic meanings, leading to [degraded performance](https://arxiv.org/abs/2306.11372). For instance, it takes 4.3x more tokens to encode the same sentence in Thai compared to that in English. This leads to the models failing to perform summarization and comprehension tasks without exceeding the context length.
|
57 |
|
58 |
Our goal for vocabulary expansion is threefold: (1) the number of newly-added tokens must be minimal and only cover the new languages, (2) the tokens should bring the compression ratios of new languages close to that of English, and (3) minimize the disruption of existing European tokens to preserve Llama-2 knowledge. In the end, we obtain **~11K** new tokens for Vi, Id, Th, and Zh to augment the original 32000-token vocabulary. Details of our expansion technique will be revealed in our upcoming technical report.
|
59 |
|
@@ -71,7 +74,7 @@ As seen in the table below, our new vocabulary reduces the compression ratio fro
|
|
71 |
### Pre-training Data
|
72 |
|
73 |
The pre-training dataset of SeaLLMs is formed by the documents from diverse public sources, including web texts (e.g., [Common Crawl](https://commoncrawl.org/)),
|
74 |
-
news documents (e.g., [CC-News](https://huggingface.co/datasets/cc_news)), academic articles and
|
75 |
We firstly employ [FastText language indentifier](https://huggingface.co/facebook/fasttext-language-identification) to filter out the documents that do not belong to Thai, Vietnamese or Indonesian.
|
76 |
To further remove harmful or undesirable content, we develop a pipeline with various data cleaning and filtering modules to preprocess the collected data.
|
77 |
Meanwhile, to maintain the English performance of SeaLLMs, we also introduce a set of high-quality English texts sampled from [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data) into pre-training.
|
@@ -88,7 +91,7 @@ We pre-train our SeaLLM-base in ~4 weeks on 32gpus, clocking ~150B tokens. We us
|
|
88 |
|
89 |
### SFT Data
|
90 |
|
91 |
-
Our supervised finetuning (SFT) data consists of many categories. The largest of them are public and open-source, such as [OpenORCA](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus). As the aforementioned are
|
92 |
|
93 |
Even more noteworthy is that we engaged native speakers to collect a small number of queries used by SEA-language native speakers in natural settings, which helps in adaptation to the local cultural customs, norms, and laws. We also collect country-relevant safety data that cover many culturally and legally sensitive topics in each of these SEA countries - such data tend to be ignored, or may even appear in conflict with Western safety data. Therefore, we believe that our models are more local-friendly and abide by local rules to a higher degree.
|
94 |
|
@@ -103,8 +106,12 @@ We conduct SFT with a relatively balanced mix of SFT data from different categor
|
|
103 |
|
104 |
One of the most reliable ways to compare chatbot models is peer comparison.
|
105 |
With the help of native speakers, we built an instruction test set that focuses on various aspects expected in a user-facing chatbot, namely:
|
106 |
-
(1)
|
107 |
-
(
|
|
|
|
|
|
|
|
|
108 |
We use GPT-4 as an evaluator to rate the comparison between our models versus ChatGPT-3.5 and other baselines.
|
109 |
|
110 |
Compared with [PolyLM-13b-chat](https://arxiv.org/pdf/2307.06018.pdf), a recent multilingual model, our model significantly outperforms across all languages and categories.
|
@@ -117,7 +124,7 @@ Compared with [PolyLM-13b-chat](https://arxiv.org/pdf/2307.06018.pdf), a recent
|
|
117 |
|
118 |
Compared with Llama-2-13b-chat, our SeaLLM-13b performs significantly better in all SEA languages,
|
119 |
despite the fact that Llama-2 was already trained on a decent data amount of Vi, Id, and Th.
|
120 |
-
In
|
121 |
|
122 |
|
123 |
|
@@ -127,7 +134,7 @@ In english, our model is 46% as good as Llama-2-13b-chat, even though it did not
|
|
127 |
</div>
|
128 |
|
129 |
Compared with ChatGPT-3.5, our SeaLLM-13b model is performing 45% as good as ChatGPT for Thai.
|
130 |
-
For important aspects such as Safety and Task-Solving, our model nearly on par with ChatGPT across the languages.
|
131 |
|
132 |
|
133 |
<div class="row" style="display: flex; clear: both;">
|
@@ -156,7 +163,7 @@ Notably, for Thai - a seemingly low-resource language, our model is just 1% behi
|
|
156 |
|
157 |
### MMLU - Preserving English-based knowledge
|
158 |
|
159 |
-
On the 5-shot [MMLU](https://arxiv.org/abs/2009.03300), our SeaLLM models not only preserve but also slightly outperform 13B LLama-2 and Llama-2-chat, despite the fact that optimizing for this English
|
160 |
|
161 |
| MMLU (Acc) | STEM | Humanities | Social | Others | Average
|
162 |
|-----------| ------- | ------- | ------- | ------- | ------- |
|
@@ -186,7 +193,7 @@ As shown in the table below, the 1-shot reading comprehension performance is sig
|
|
186 |
|
187 |
For translation tasks, we evaluate our models with the [FloRes-200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md) using [chrF++](https://aclanthology.org/W15-3049/) scores in 4-shot settings.
|
188 |
|
189 |
-
Similarly observed, our SeaLLM
|
190 |
|
191 |
|
192 |
| FloRes-200 (chrF++) | En-Zh | En-Vi | En-Id | En-Th | En->X | Zh-En | Vi-En | Id-En | Th-En | X->En
|
@@ -204,7 +211,7 @@ Our models are also performing competitively with ChatGPT for translation betwee
|
|
204 |
|
205 |
#### Summarization
|
206 |
|
207 |
-
Lastly, in 2-shot [XL-sum summarization tasks](https://aclanthology.org/2021.findings-acl.413/), our
|
208 |
|
209 |
| XL-Sum (rouge-L) | En | Zh | Vi | Id | Th
|
210 |
|-------- | ---- | ---- | ---- | ---- | ---- |
|
@@ -214,7 +221,7 @@ Lastly, in 2-shot [XL-sum summarization tasks](https://aclanthology.org/2021.fin
|
|
214 |
|
215 |
## Acknowledge our linguists
|
216 |
|
217 |
-
We would like to express our special thanks to our professional and native linguists, who helped build, evaluate and fact-check our
|
218 |
|
219 |
## Citation
|
220 |
|
@@ -228,4 +235,4 @@ If you find our project useful, hope you can star our repo and cite our work as
|
|
228 |
title = {SeaLLMs - Large Language Models for Southeast Asia},
|
229 |
year = 2023,
|
230 |
}
|
231 |
-
```
|
|
|
8 |
- id
|
9 |
- th
|
10 |
- zh
|
11 |
+
tags:
|
12 |
+
- multilingual
|
13 |
+
- sea
|
14 |
---
|
15 |
<p align="center">
|
16 |
<img src="seal_logo.png" width="200" />
|
|
|
56 |
## Pre-training
|
57 |
|
58 |
### Vocabulary Expansion
|
59 |
+
Like many English/Latin-dominant LLMs, Llama-2's BPE tokenizer breaks non-European and non-Latin linguistic texts into unsustainably long byte-level sequences that cover much shorter semantic meanings, leading to [degraded performance](https://arxiv.org/abs/2306.11372). For instance, it takes 4.3x more tokens to encode the same sentence in Thai compared to that in English (see below table). This leads to the models failing to perform tasks requiring long context modeling (e.g., summarization and comprehension tasks) without exceeding the context length.
|
60 |
|
61 |
Our goal for vocabulary expansion is threefold: (1) the number of newly-added tokens must be minimal and only cover the new languages, (2) the tokens should bring the compression ratios of new languages close to that of English, and (3) minimize the disruption of existing European tokens to preserve Llama-2 knowledge. In the end, we obtain **~11K** new tokens for Vi, Id, Th, and Zh to augment the original 32000-token vocabulary. Details of our expansion technique will be revealed in our upcoming technical report.
|
62 |
|
|
|
74 |
### Pre-training Data
|
75 |
|
76 |
The pre-training dataset of SeaLLMs is formed by the documents from diverse public sources, including web texts (e.g., [Common Crawl](https://commoncrawl.org/)),
|
77 |
+
news documents (e.g., [CC-News](https://huggingface.co/datasets/cc_news)), academic articles, and texts with expert knowledge (e.g., Wikipedia).
|
78 |
We firstly employ [FastText language indentifier](https://huggingface.co/facebook/fasttext-language-identification) to filter out the documents that do not belong to Thai, Vietnamese or Indonesian.
|
79 |
To further remove harmful or undesirable content, we develop a pipeline with various data cleaning and filtering modules to preprocess the collected data.
|
80 |
Meanwhile, to maintain the English performance of SeaLLMs, we also introduce a set of high-quality English texts sampled from [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data) into pre-training.
|
|
|
91 |
|
92 |
### SFT Data
|
93 |
|
94 |
+
Our supervised finetuning (SFT) data consists of many categories. The largest of them are public and open-source, such as [OpenORCA](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus). As the aforementioned are English only, we employ several established or novel automatic techniques to gather more instruction data for SEA languages.
|
95 |
|
96 |
Even more noteworthy is that we engaged native speakers to collect a small number of queries used by SEA-language native speakers in natural settings, which helps in adaptation to the local cultural customs, norms, and laws. We also collect country-relevant safety data that cover many culturally and legally sensitive topics in each of these SEA countries - such data tend to be ignored, or may even appear in conflict with Western safety data. Therefore, we believe that our models are more local-friendly and abide by local rules to a higher degree.
|
97 |
|
|
|
106 |
|
107 |
One of the most reliable ways to compare chatbot models is peer comparison.
|
108 |
With the help of native speakers, we built an instruction test set that focuses on various aspects expected in a user-facing chatbot, namely:
|
109 |
+
(1) task-solving (e.g. translation & comprehension),
|
110 |
+
(2) math-reasoning (e.g., math and logical reasoning questions),
|
111 |
+
(3) general-instruction (e.g., instructions in general domains),
|
112 |
+
(4) natural-questions (e.g., questions about local context often written informally), and
|
113 |
+
(5) safety-related questions.
|
114 |
+
The test set also covers all languages that we are concerned with.
|
115 |
We use GPT-4 as an evaluator to rate the comparison between our models versus ChatGPT-3.5 and other baselines.
|
116 |
|
117 |
Compared with [PolyLM-13b-chat](https://arxiv.org/pdf/2307.06018.pdf), a recent multilingual model, our model significantly outperforms across all languages and categories.
|
|
|
124 |
|
125 |
Compared with Llama-2-13b-chat, our SeaLLM-13b performs significantly better in all SEA languages,
|
126 |
despite the fact that Llama-2 was already trained on a decent data amount of Vi, Id, and Th.
|
127 |
+
In English, our model is 46% as good as Llama-2-13b-chat, even though it did not undergo complex human-labor intensive RLHF.
|
128 |
|
129 |
|
130 |
|
|
|
134 |
</div>
|
135 |
|
136 |
Compared with ChatGPT-3.5, our SeaLLM-13b model is performing 45% as good as ChatGPT for Thai.
|
137 |
+
For important aspects such as Safety and Task-Solving, our model is nearly on par with ChatGPT across the languages.
|
138 |
|
139 |
|
140 |
<div class="row" style="display: flex; clear: both;">
|
|
|
163 |
|
164 |
### MMLU - Preserving English-based knowledge
|
165 |
|
166 |
+
On the 5-shot [MMLU](https://arxiv.org/abs/2009.03300), our SeaLLM models not only preserve but also slightly outperform 13B LLama-2 and Llama-2-chat, despite the fact that optimizing for this English dominant test set is not part of our goal.
|
167 |
|
168 |
| MMLU (Acc) | STEM | Humanities | Social | Others | Average
|
169 |
|-----------| ------- | ------- | ------- | ------- | ------- |
|
|
|
193 |
|
194 |
For translation tasks, we evaluate our models with the [FloRes-200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md) using [chrF++](https://aclanthology.org/W15-3049/) scores in 4-shot settings.
|
195 |
|
196 |
+
Similarly observed, our SeaLLM model outperforms Llama-2 significantly in the new languages.
|
197 |
|
198 |
|
199 |
| FloRes-200 (chrF++) | En-Zh | En-Vi | En-Id | En-Th | En->X | Zh-En | Vi-En | Id-En | Th-En | X->En
|
|
|
211 |
|
212 |
#### Summarization
|
213 |
|
214 |
+
Lastly, in 2-shot [XL-sum summarization tasks](https://aclanthology.org/2021.findings-acl.413/), our model also achieves better performance, with substantial gains in Thai.
|
215 |
|
216 |
| XL-Sum (rouge-L) | En | Zh | Vi | Id | Th
|
217 |
|-------- | ---- | ---- | ---- | ---- | ---- |
|
|
|
221 |
|
222 |
## Acknowledge our linguists
|
223 |
|
224 |
+
We would like to express our special thanks to our professional and native linguists, who helped build, evaluate, and fact-check our sampled pretraining and SFT dataset as well as evaluating our models across different aspects, especially safety.
|
225 |
|
226 |
## Citation
|
227 |
|
|
|
235 |
title = {SeaLLMs - Large Language Models for Southeast Asia},
|
236 |
year = 2023,
|
237 |
}
|
238 |
+
```
|