multilingual
sea
maljunied commited on
Commit
ed2272f
ยท
1 Parent(s): 747f6bf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -19
README.md CHANGED
@@ -34,11 +34,11 @@ tags:
34
  <a href="https://arxiv.org/pdf/2312.00738.pdf" target="_blank" rel="noopener">Technical Report</a>
35
  </p>
36
 
37
- We introduce SeaLLMs - a family of language models optimized for Southeast Asian (SEA) languages. The SeaLLM-base models (to be released) were pre-trained from [Llama-2](https://huggingface.co/meta-llama/Llama-2-13b-hf), on a tailored publicly-available dataset, which comprises texts in Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ, Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ, Thai ๐Ÿ‡น๐Ÿ‡ญ, Malay ๐Ÿ‡ฒ๐Ÿ‡พ, Khmer๐Ÿ‡ฐ๐Ÿ‡ญ, Lao๐Ÿ‡ฑ๐Ÿ‡ฆ, Tagalog๐Ÿ‡ต๐Ÿ‡ญ and Burmese๐Ÿ‡ฒ๐Ÿ‡ฒ. The [SeaLLM-chat](https://huggingface.co/spaces/SeaLLMs/SeaLLM-Chat-13b) underwent supervised finetuning (SFT) and specialized self-preferencing DPO using a mix of public instruction data and a small number of queries used by SEA language native speakers in natural settings, which **adapt to the local cultural norms, customs, styles and laws in these areas**.
38
 
39
- SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform **ChatGPT-3.5** in non-Latin languages, such as Thai, Khmer, Lao, and Burmese.
40
 
41
- - DEMO: [SeaLLMs/SeaLLM-Chat-13b](https://huggingface.co/spaces/SeaLLMs/SeaLLM-Chat-13b) DEMO allows **batch-inference** for evaluation purpose.
42
  - Model weights: To be released.
43
  - Technical report: [Arxiv: SeaLLMs - Large Language Models for Southeast Asia](https://arxiv.org/pdf/2312.00738.pdf).
44
 
@@ -48,7 +48,7 @@ By using our released weights, codes, and demos, you agree to and comply with th
48
  </blockquote>
49
 
50
  > **Disclaimer**:
51
- > We must note that even though the weights, codes, and demos are released in an open manner, similar to other pre-trained language models, and despite our best efforts in red teaming and safety finetuning and enforcement, our models come with potential risks, including but not limited to inaccurate, misleading or potentially harmful generation.
52
  > Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations.
53
  > In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights, codes, or demos.
54
 
@@ -64,21 +64,21 @@ The following sections summarize the [performance evaluations](#evaluation) of S
64
 
65
  One of the most reliable ways to compare chatbot models is peer comparison.
66
  With the help of native speakers, we built an instruction test set, called [Sea-bench](https://huggingface.co/datasets/SeaLLMs/Sea-bench) that focuses on various aspects expected in a user-facing chatbot, namely:
67
- (1) task-solving (e.g. translation & comprehension),
68
  (2) math-reasoning (e.g., math and logical reasoning questions),
69
  (3) general-instruction (e.g., instructions in general domains),
70
  (4) natural-questions (e.g., questions about local context often written informally), and
71
  (5) safety-related questions.
72
  The test set also covers all languages that we are concerned with.
73
- Similar to [MT-bench](https://huggingface.co/spaces/lmsys/mt-bench), We use **GPT-4** as an evaluator to rate the comparison between our models versus ChatGPT-3.5 and other baselines.
74
 
75
- We evaluate Sea-bench in 2 mode: Score-based grading (0 to 10) and Peer comparison.
76
 
77
  ![fig_sea_bench_side_by_side.png](img/fig_sea_bench_side_by_side.png)
78
 
79
  As shown in the figure above, as aggregated by task category (left radar chart), our SeaLLM-13b model performs on-par or surpasses ChatGPT-3.5 across many linguistic and writing tasks. This is despite [reported evidence](https://arxiv.org/abs/2309.17012) that GPT-4 evaluator may favor ChatGPT more often than humans do.
80
 
81
- Comparing instruction-following capabilities of models in the angle of different SEA languages. As shown, SeaLLM-13b outperforms ChatGPT-3.5 by large margins in most non-Latin languages, such as Burmese (Mya), Lao, Khmer and Thai. In combination with the fact that SeaLLM can encode these languages with up to 9 times fewer tokens, our models are not only superior but also cheaper to operate in these languages than ChatGPT. This helps democratize the benefits of large language models to under-represented and potentially developing communities.
82
 
83
 
84
  <div class="row" style="display: flex; clear: both;">
@@ -87,13 +87,13 @@ Comparing instruction-following capabilities of models in the angle of different
87
  </div>
88
 
89
 
90
- We also compare our model head-on with ChatGPT in peer comparison, as seen above. SeaLLM-13b is equal or better than ChatGPT for up to 40% of the times for Latin-based languages (Eng, Vie, Ind, Msa). In contrast, for non-Latin languages, our SeaLLM-13b surpasses ChatGPT by up to 90%.
91
 
92
  ### Safety Enchancement in Local Context
93
 
94
- There is growing [evidence](https://arxiv.org/pdf/2310.06474.pdf) that western-built LLMs often neglect safety protection in many lower-resource languages, or even promote contents that may be locally perceived as harmful, inappropriate or illegal by local norms and laws. We take efforts in adapting and safeguarding our SeaLLM models to achieve greater adoption and compliance for the regional audience of Southeast Asia.
95
 
96
- The below dropdown table showcases examples of potentially harmful content that ChatGPT generates whereas our model behaves safer and complies with the regulations.
97
 
98
  <details>
99
  <summary><span style="color: red">WARNING:</span> The dropdown will display potentially harmful content.</summary>
@@ -112,7 +112,7 @@ The below dropdown table showcases examples of potentially harmful content that
112
  ### M3Exam - World Knowledge in Regional Languages
113
 
114
 
115
- [M3Exam](https://arxiv.org/pdf/2306.05179.pdf) is a collection of real-life and native official human exam question benchmarks. This benchmark covers questions from multiple countries in the SEA region, which require strong multilingual proficiency and cultural knowledge across various critical educational periods, from primary- to high-school levels of difficulty.
116
 
117
  As shown in the table, our SeaLLM model outperforms most 13B baselines and reaches closer to ChatGPT's performance.
118
  Notably, for Thai - a seemingly low-resource language, our model is just 1% behind ChatGPT despite the large size difference.
@@ -133,7 +133,7 @@ Notably, for Thai - a seemingly low-resource language, our model is just 1% behi
133
 
134
  ### MMLU - Preserving English-based knowledge
135
 
136
- On the 5-shot [MMLU](https://arxiv.org/abs/2009.03300), our SeaLLM models not only preserve but also slightly outperform 13B LLama-2 and Llama-2-chat, despite the fact that optimizing for this English dominant test set is not part of our goal.
137
 
138
  | MMLU (Acc) | Average
139
  |----------- | ------- |
@@ -148,7 +148,7 @@ On the 5-shot [MMLU](https://arxiv.org/abs/2009.03300), our SeaLLM models not on
148
 
149
  ![fig_translate](img/fig_translation.png)
150
 
151
- We use the [Flores-200](https://huggingface.co/datasets/facebook/flores) to to test our models ability in machine translation. As shown in above figure, SeaLLM-13B exhibits clear superiority over ChatGPT-3.5 in low-resource languages, such as Lao and Khmer, while maintaining comparable performance with ChatGPT-3.5 in most high-resource languages (e.g., Vietnamese and Indonesian).
152
 
153
 
154
 
@@ -189,8 +189,8 @@ We conduct pre-training in multiple stages. Each stage serves a different specif
189
 
190
  ### Supervised fine-tuning (SFT) Data
191
 
192
- Our supervised finetuning (SFT) data consists of many categories. The largest and most dominant of them are public and open-source. As the aforementioned are English only, we employ several established automatic techniques to gather more instruction data for SEA languages through synthetic means. For a small number of SFT data, we engaged native speakers to vet, verify and modify SFT responses so that they adapt to the local cultural customs, norms, and laws.
193
- We also collect country-relevant safety data that cover many culturally and legally sensitive topics in each of these SEA countries - such data tend to be ignored, or may even appear in conflict with Western safety data. Therefore, we believe that our models are more local-friendly and abide by local rules to a higher degree.
194
 
195
  ### SFT Strategies
196
 
@@ -198,9 +198,9 @@ We conduct SFT with a relatively balanced mix of SFT data from different categor
198
 
199
  ### Self-preferencing DPO
200
 
201
- To save the cost of human preference annotation work, [some](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) have sought to use powerful LLMs like GPT-4 to play as a preference data generator. However, that may not even be feasible for low-resource non-Latin languages because of the unfavorable tokenization of ChatGPT as explained above. In other words, even short prompts would exceed their context-length and the API-call costs would explode by up to 17 times.
202
 
203
- Therefore, we use our own SeaLLM SFT models to generate preference data using a special prompting strategy, which we later use to employ direct preference optimization (DPO) to significantly improve the model abilities as an AI agent. As such, our models are free from relying on powerful close-sourced models like GPT-4 to improve the performance in low-resource languages.
204
 
205
  ## Acknowledgement to Our Linguists
206
 
@@ -208,7 +208,7 @@ We would like to express our special thanks to our professional and native lingu
208
 
209
  ## Citation
210
 
211
- If you find our project useful, hope you can star our repo and cite our work as follows. Corresponding Author: [[email protected]](mailto:[email protected])
212
 
213
  ```
214
  @article{damonlpsg2023seallm,
 
34
  <a href="https://arxiv.org/pdf/2312.00738.pdf" target="_blank" rel="noopener">Technical Report</a>
35
  </p>
36
 
37
+ We introduce SeaLLMs - a family of language models optimized for Southeast Asian (SEA) languages. The SeaLLM-base models (to be released) were pre-trained from [Llama-2](https://huggingface.co/meta-llama/Llama-2-13b-hf), on a tailored publicly-available dataset, which comprises texts in Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ, Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ, Thai ๐Ÿ‡น๐Ÿ‡ญ, Malay ๐Ÿ‡ฒ๐Ÿ‡พ, Khmer ๐Ÿ‡ฐ๐Ÿ‡ญ, Lao ๐Ÿ‡ฑ๐Ÿ‡ฆ, Tagalog ๐Ÿ‡ต๐Ÿ‡ญ and Burmese ๐Ÿ‡ฒ๐Ÿ‡ฒ. The [SeaLLM-chat](https://huggingface.co/spaces/SeaLLMs/SeaLLM-Chat-13b) underwent supervised fine-tuning (SFT) and specialized self-preferencing DPO using a mix of public instruction data and a small number of queries used by SEA language native speakers in natural settings, which **adapt to the local cultural norms, customs, styles and laws in these areas**.
38
 
39
+ SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform **ChatGPT-3.5** in non-Latin languages such as Thai, Khmer, Lao, and Burmese.
40
 
41
+ - DEMO: [SeaLLMs/SeaLLM-Chat-13b](https://huggingface.co/spaces/SeaLLMs/SeaLLM-Chat-13b) DEMO allows **batch-inference** for evaluation purposes.
42
  - Model weights: To be released.
43
  - Technical report: [Arxiv: SeaLLMs - Large Language Models for Southeast Asia](https://arxiv.org/pdf/2312.00738.pdf).
44
 
 
48
  </blockquote>
49
 
50
  > **Disclaimer**:
51
+ > We must note that even though the weights, codes, and demos are released in an open manner, similar to other pre-trained language models, and despite our best efforts in red teaming and safety fine-tuning and enforcement, our models come with potential risks, including but not limited to inaccurate, misleading or potentially harmful generation.
52
  > Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations.
53
  > In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights, codes, or demos.
54
 
 
64
 
65
  One of the most reliable ways to compare chatbot models is peer comparison.
66
  With the help of native speakers, we built an instruction test set, called [Sea-bench](https://huggingface.co/datasets/SeaLLMs/Sea-bench) that focuses on various aspects expected in a user-facing chatbot, namely:
67
+ (1) task-solving (e.g., translation & comprehension),
68
  (2) math-reasoning (e.g., math and logical reasoning questions),
69
  (3) general-instruction (e.g., instructions in general domains),
70
  (4) natural-questions (e.g., questions about local context often written informally), and
71
  (5) safety-related questions.
72
  The test set also covers all languages that we are concerned with.
73
+ Similar to [MT-bench](https://huggingface.co/spaces/lmsys/mt-bench), we use **GPT-4** as an evaluator to rate the comparison between our models versus ChatGPT-3.5 and other baselines.
74
 
75
+ We evaluate Sea-bench in 2 modes: Score-based grading (0 to 10) and Peer comparison.
76
 
77
  ![fig_sea_bench_side_by_side.png](img/fig_sea_bench_side_by_side.png)
78
 
79
  As shown in the figure above, as aggregated by task category (left radar chart), our SeaLLM-13b model performs on-par or surpasses ChatGPT-3.5 across many linguistic and writing tasks. This is despite [reported evidence](https://arxiv.org/abs/2309.17012) that GPT-4 evaluator may favor ChatGPT more often than humans do.
80
 
81
+ When comparing instruction-following capabilities of models from the angle of the different SEA languages, as shown, SeaLLM-13b outperforms ChatGPT-3.5 by large margins in most non-Latin languages, such as Burmese (Mya), Lao, Khmer and Thai. In combination with the fact that SeaLLM can encode these languages with up to 9 times fewer tokens, our models are not only superior but also cheaper to operate in these languages than ChatGPT. This helps democratize the benefits of large language models to under-represented and potentially developing communities.
82
 
83
 
84
  <div class="row" style="display: flex; clear: both;">
 
87
  </div>
88
 
89
 
90
+ We also compare our model head-on with ChatGPT in peer comparison, as seen above. SeaLLM-13b is equal or better than ChatGPT for up to 40% of the time for Latin-based languages (Eng, Vie, Ind, Msa). In contrast, for non-Latin languages, SeaLLM-13b surpasses ChatGPT by up to 90%.
91
 
92
  ### Safety Enchancement in Local Context
93
 
94
+ There is growing [evidence](https://arxiv.org/pdf/2310.06474.pdf) that western-built LLMs often neglect safety protection in many lower-resource languages, or even promote contents that may be locally perceived as harmful, inappropriate or illegal by local norms and laws. We take great effort in adapting and safeguarding our SeaLLM models so as to achieve wider adoption and compliance for the regional audiences of Southeast Asia.
95
 
96
+ The dropdown table below showcases examples of potentially harmful content that ChatGPT generates, whereas our model behaves safer and complies with the regulations.
97
 
98
  <details>
99
  <summary><span style="color: red">WARNING:</span> The dropdown will display potentially harmful content.</summary>
 
112
  ### M3Exam - World Knowledge in Regional Languages
113
 
114
 
115
+ [M3Exam](https://arxiv.org/pdf/2306.05179.pdf) is a collection of real-life and native official human exam question benchmark. This benchmark covers questions from multiple countries in the SEA region, which require strong multilingual proficiency and cultural knowledge across various critical educational periods, from primary- to high-school levels of difficulty.
116
 
117
  As shown in the table, our SeaLLM model outperforms most 13B baselines and reaches closer to ChatGPT's performance.
118
  Notably, for Thai - a seemingly low-resource language, our model is just 1% behind ChatGPT despite the large size difference.
 
133
 
134
  ### MMLU - Preserving English-based knowledge
135
 
136
+ On the 5-shot [MMLU](https://arxiv.org/abs/2009.03300), our SeaLLM models not only preserve but also slightly outperforms 13B LLama-2 and Llama-2-chat, despite the fact that optimizing for this English dominant test set is not part of our goal.
137
 
138
  | MMLU (Acc) | Average
139
  |----------- | ------- |
 
148
 
149
  ![fig_translate](img/fig_translation.png)
150
 
151
+ We use the [Flores-200](https://huggingface.co/datasets/facebook/flores) to test our model's ability in machine translation. As shown in the above figure, SeaLLM-13B exhibits clear superiority over ChatGPT-3.5 in low-resource languages, such as Lao and Khmer, while maintaining comparable performances with ChatGPT-3.5 in most high-resource languages (e.g., Vietnamese and Indonesian).
152
 
153
 
154
 
 
189
 
190
  ### Supervised fine-tuning (SFT) Data
191
 
192
+ Our supervised fine-tuning (SFT) data consists of many categories. The largest and most dominant of them are public and open-source. As these data are available only in English, we employ several established automatic techniques to gather more instruction data for SEA languages through synthetic means. For a small number of SFT data, we engaged native speakers to vet, verify and modify SFT responses so that they adapt to the local cultural customs, norms, and laws.
193
+ We also collect country-relevant safety data that cover many culturally and legally sensitive topics in each of these SEA countries - such data tend to be ignored, or may even appear in conflict with Western safety data. Therefore, we believe that our models are more native country-friendly and abide by local rules to a higher degree.
194
 
195
  ### SFT Strategies
196
 
 
198
 
199
  ### Self-preferencing DPO
200
 
201
+ To save on costs of human preference annotation work, [some](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) have sought to use powerful LLMs like GPT-4 to play as a preference data generator. However, that may not even be feasible for low-resource non-Latin languages because of the unfavorable tokenization of ChatGPT as explained above. In other words, even short prompts would exceed their context-length and the API-call costs would explode by up to 17 times.
202
 
203
+ Therefore, we use our own SeaLLM SFT models to generate preference data using a special prompting strategy, which we later use to employ direct preference optimization (DPO) to significantly improve the models' abilities as an AI agent. As such, our models are free from relying on powerful close-sourced models like GPT-4 to improve the performance on low-resource languages.
204
 
205
  ## Acknowledgement to Our Linguists
206
 
 
208
 
209
  ## Citation
210
 
211
+ If you find our project useful, we hope you would kindly star our repo and cite our work as follows: Corresponding Author: [[email protected]](mailto:[email protected])
212
 
213
  ```
214
  @article{damonlpsg2023seallm,