multilingual
sea
nxphi47 commited on
Commit
9caf7f3
ยท
1 Parent(s): 7cf5394

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -10
README.md CHANGED
@@ -6,24 +6,20 @@ inference: false
6
  <img src="seal_logo.png" width="200" />
7
  </p>
8
 
9
- # SeaLLM - An Assistant for South East Asian Languages
10
 
11
 
12
- <!-- - DEMO: [DAMO-NLP-SG/damo-seal-v0](https://huggingface.co/spaces/DAMO-NLP-SG/damo-seal-v0) -->
13
-
14
  <p align="center">
15
- ๐Ÿค— <a href="https://huggingface.co/spaces/DAMO-NLP-SG/damo-seal-v0">Hugging Face DEMO</a>
16
  </p>
17
 
18
- We introduce SeaLLM - a family of language models optimized for South East Asian (SEA) languages. The SeaLLM-base models (to be released) were pre-trained from [Llama-2](https://huggingface.co/meta-llama/Llama-2-13b-hf), on a tailored publicly-available dataset, which comprises mainly Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ, Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ and Thai ๐Ÿ‡น๐Ÿ‡ญ texts, along with those in English ๐Ÿ‡ฌ๐Ÿ‡ง and Chinese ๐Ÿ‡จ๐Ÿ‡ณ. The pre-training stage involves multiple stages with dynamic data control to preserve the original knowledge base of Llama-2 while gaining new abilities in SEA languages.
19
 
20
- The [SeaLLM-chat](https://huggingface.co/spaces/DAMO-NLP-SG/damo-seal-v0) model underwent supervised finetuning (SFT) on a mix of public instruction data (e.g. [OpenORCA](https://huggingface.co/datasets/Open-Orca/OpenOrca)) and a small internally-collected amount of natural queries from SEA native speakers, which **adapt to the local cultural norms, customs, styles and laws in these regions**, as well as other SFT enhancement techniques (to be revealed later).
21
 
22
  Our customized SFT process helps enhance our models' ability to understand, respond and serve communities whose languages are often neglected by previous [English-dominant LLMs](https://arxiv.org/abs/2307.09288), while outperforming existing polyglot LLMs, like [BLOOM](https://arxiv.org/abs/2211.05100) or [PolyLM](https://arxiv.org/pdf/2307.06018.pdf).
23
 
24
- Our [first released SeaLLM](https://huggingface.co/spaces/DAMO-NLP-SG/damo-seal-v0) supports Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ, Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ and Thai ๐Ÿ‡น๐Ÿ‡ญ. Future verions endeavor to cover all languages spoken in South East Asia.
25
-
26
- <!-- - Model links: [DAMO-NLP-SG/seal-13b-chat-a](https://huggingface.co/DAMO-NLP-SG/seal-13b-chat-a) -->
27
 
28
 
29
  <blockquote style="color:red">
@@ -204,7 +200,7 @@ If you find our project useful, hope you can star our repo and cite our work as
204
  ```
205
  @article{damonlpsg2023seallm,
206
  author = {???},
207
- title = {SeaLLM: A language model for South East Asian Languages},
208
  year = 2023,
209
  }
210
  ```
 
6
  <img src="seal_logo.png" width="200" />
7
  </p>
8
 
9
+ # SeaLLMs - Large Language Models for Southeast Asia
10
 
11
 
 
 
12
  <p align="center">
13
+ ๐Ÿค— <a href="https://huggingface.co/spaces/SeaLLMs/SeaLLM-chat-13b-demo">Hugging Face DEMO</a>
14
  </p>
15
 
16
+ We introduce SeaLLM - a family of language models optimized for Southeast Asian (SEA) languages. The SeaLLM-base models (to be released) were pre-trained from [Llama-2](https://huggingface.co/meta-llama/Llama-2-13b-hf), on a tailored publicly-available dataset, which comprises mainly Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ, Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ and Thai ๐Ÿ‡น๐Ÿ‡ญ texts, along with those in English ๐Ÿ‡ฌ๐Ÿ‡ง and Chinese ๐Ÿ‡จ๐Ÿ‡ณ. The pre-training stage involves multiple stages with dynamic data control to preserve the original knowledge base of Llama-2 while gaining new abilities in SEA languages.
17
 
18
+ The [SeaLLM-chat](https://huggingface.co/spaces/SeaLLMs/SeaLLM-chat-13b-demo) model underwent supervised finetuning (SFT) on a mix of public instruction data (e.g. [OpenORCA](https://huggingface.co/datasets/Open-Orca/OpenOrca)) and a small internally-collected amount of natural queries from SEA native speakers, which **adapt to the local cultural norms, customs, styles and laws in these regions**, as well as other SFT enhancement techniques (to be revealed later).
19
 
20
  Our customized SFT process helps enhance our models' ability to understand, respond and serve communities whose languages are often neglected by previous [English-dominant LLMs](https://arxiv.org/abs/2307.09288), while outperforming existing polyglot LLMs, like [BLOOM](https://arxiv.org/abs/2211.05100) or [PolyLM](https://arxiv.org/pdf/2307.06018.pdf).
21
 
22
+ Our [first released SeaLLM](https://huggingface.co/spaces/SeaLLMs/SeaLLM-chat-13b-demo) supports Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ, Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ and Thai ๐Ÿ‡น๐Ÿ‡ญ. Future verions endeavor to cover all languages spoken in Southeast Asia.
 
 
23
 
24
 
25
  <blockquote style="color:red">
 
200
  ```
201
  @article{damonlpsg2023seallm,
202
  author = {???},
203
+ title = {SeaLLMs - Large Language Models for Southeast Asia},
204
  year = 2023,
205
  }
206
  ```