|
--- |
|
language: |
|
- ms |
|
- en |
|
- zh |
|
- ta |
|
--- |
|
|
|
# Malaysian Llama 3.1 70B Instruct |
|
|
|
Continue finetuning https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct on highly curated 1.5B tokens Malaysian instruction dataset. |
|
|
|
## Improvement |
|
|
|
1. Support respond in Mandarin, Tamil, Jawi, Manglish, Johor, Kedah, Kelantan, Pahang, Perak, Sabah, Sarawak, Selangor, Negeri Sembilan and Terengganu. |
|
2. Able to code in Mandarin, Tamil, Jawi, Manglish, Johor, Kedah, Kelantan, Pahang, Perak, Sabah, Sarawak, Selangor, Negeri Sembilan and Terengganu. |
|
3. Multi-turn Malaysian context such as related to Malaysian Legislation, politics, religions and languages. |
|
|
|
## Training session |
|
|
|
Finetune on [mesolitica/Malaysian-SFT](https://huggingface.co/datasets/mesolitica/Malaysian-SFT) to make the model understand Malaysian context. |
|
|
|
## How we train |
|
|
|
1. LoRA on `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head"]`. |
|
2. 128 Rank with alpha 256, or alpha of 2.0 |
|
3. Multipacking 8192 context length with proper SDPA causal masking to prevent document contamination and also make sure proper position ids. |
|
4. Chunk CCE loss for LoRA. |
|
5. WanDB at https://wandb.ai/huseinzol05/lora-embedding-128-llama3.1-70b-malaysian-8k?nw=nwuserhuseinzol05 |
|
|
|
Source code at https://github.com/mesolitica/malaya/tree/master/session/llama3 |
|
|
|
## Acknowledgement |
|
|
|
Special thanks to https://www.sns.com.my for 8x H100 node! |