|
--- |
|
language: |
|
- ms |
|
- en |
|
- zh |
|
- ta |
|
--- |
|
|
|
# Malaysian Llama-3.2-3B-Instruct |
|
|
|
Continue finetuning https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct on highly curated 1.5B tokens Malaysian instruction dataset. |
|
|
|
## Improvement |
|
|
|
1. Support respond in Mandarin, Tamil, Jawi, Manglish, Johor, Kedah, Kelantan, Pahang, Perak, Sabah, Sarawak, Selangor, Negeri Sembilan and Terengganu. |
|
2. Able to code in Mandarin, Tamil, Jawi, Manglish, Johor, Kedah, Kelantan, Pahang, Perak, Sabah, Sarawak, Selangor, Negeri Sembilan and Terengganu. |
|
3. Multi-turn Malaysian context such as related to Malaysian Legislation, politics, religions and languages. |
|
|
|
## Training session |
|
|
|
Finetune on [mesolitica/Malaysian-SFT](https://huggingface.co/datasets/mesolitica/Malaysian-SFT) to make the model understand Malaysian context. |
|
|
|
## How we train |
|
|
|
1. LoRA on `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "embed_tokens", "lm_head"]`. |
|
2. 128 Rank with alpha 256, or alpha of 2.0 |
|
3. Multipacking 8192 context length with proper SDPA causal masking to prevent document contamination and also make sure proper position ids. |
|
4. Chunk CCE loss for LoRA. |
|
5. WanDB at https://wandb.ai/huseinzol05/lora-embedding-128-llama3.2-3b-malaysian-8k?nw=nwuserhuseinzol05 |
|
|
|
Source code at https://github.com/mesolitica/malaya/tree/master/session/llama3 |
|
|
|
## Benchmark |
|
|
|
### MalayMMLU |
|
|
|
#### Probability next tokens |
|
|
|
Based on 0-shot official MalayMMLU First token accuracy, |
|
|
|
``` |
|
Model Accuracy shot by_letter category |
|
0 Malaysian-Llama-3.2-3B-Instruct 57.634056 0shot True STEM |
|
1 Malaysian-Llama-3.2-3B-Instruct 59.351145 0shot True Language |
|
2 Malaysian-Llama-3.2-3B-Instruct 57.559988 0shot True Social science |
|
3 Malaysian-Llama-3.2-3B-Instruct 57.303910 0shot True Others |
|
4 Malaysian-Llama-3.2-3B-Instruct 60.022753 0shot True Humanities |
|
{'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443} |
|
Model : Malaysian-Llama-3.2-3B-Instruct |
|
Metric : first |
|
Shot : 0shot |
|
average accuracy 58.43555115020857 |
|
accuracy for STEM 57.63405648792468 |
|
accuracy for Language 59.35114503816794 |
|
accuracy for Social science 57.55998843596415 |
|
accuracy for Others 57.30390981050611 |
|
accuracy for Humanities 60.02275312855517 |
|
``` |
|
|
|
While the original model, |
|
|
|
``` |
|
Model Accuracy shot by_letter category |
|
0 Llama-3.2-3B-Instruct 56.733524 0shot True STEM |
|
1 Llama-3.2-3B-Instruct 58.460560 0shot True Language |
|
2 Llama-3.2-3B-Instruct 54.206418 0shot True Social science |
|
3 Llama-3.2-3B-Instruct 52.554569 0shot True Others |
|
4 Llama-3.2-3B-Instruct 60.659841 0shot True Humanities |
|
{'Social science': 6918, 'Language': 6288, 'Humanities': 4395, 'Others': 4169, 'STEM': 2443} |
|
Model : Llama-3.2-3B-Instruct |
|
Metric : first |
|
Shot : 0shot |
|
average accuracy 56.453145004749516 |
|
accuracy for STEM 56.73352435530086 |
|
accuracy for Language 58.460559796437664 |
|
accuracy for Social science 54.20641803989592 |
|
accuracy for Others 52.554569441112974 |
|
accuracy for Humanities 60.659840728100114 |
|
``` |
|
|
|
#### First token match using vLLM |
|
|
|
Based on 0-shot exact first token match using vLLM, |
|
|
|
``` |
|
Model Accuracy shot category |
|
0 Malaysian-Llama-3.2-3B-Instruct 51.944331 0 STEM |
|
1 Malaysian-Llama-3.2-3B-Instruct 50.795165 0 Language |
|
2 Malaysian-Llama-3.2-3B-Instruct 52.732003 0 Social science |
|
3 Malaysian-Llama-3.2-3B-Instruct 52.026865 0 Others |
|
4 Malaysian-Llama-3.2-3B-Instruct 54.539249 0 Humanities |
|
Model : Malaysian-Llama-3.2-3B-Instruct |
|
Metric : full |
|
Shot : 0 |
|
average accuracy 52.35617230413414 |
|
accuracy for STEM 51.94433074089234 |
|
accuracy for Language 50.795165394402034 |
|
accuracy for Social science 52.73200346921075 |
|
accuracy for Others 52.02686495562485 |
|
accuracy for Humanities 54.53924914675768 |
|
``` |
|
|
|
While the original model, |
|
|
|
``` |
|
Model Accuracy shot category |
|
0 Llama-3.2-3B-Instruct 50.511666 0 STEM |
|
1 Llama-3.2-3B-Instruct 49.825064 0 Language |
|
2 Llama-3.2-3B-Instruct 48.352125 0 Social science |
|
3 Llama-3.2-3B-Instruct 48.213001 0 Others |
|
4 Llama-3.2-3B-Instruct 51.990899 0 Humanities |
|
Model : Llama-3.2-3B-Instruct |
|
Metric : full |
|
Shot : 0 |
|
average accuracy 49.58906372609755 |
|
accuracy for STEM 50.51166598444535 |
|
accuracy for Language 49.82506361323155 |
|
accuracy for Social science 48.35212489158716 |
|
accuracy for Others 48.21300071959703 |
|
accuracy for Humanities 51.990898748577926 |
|
``` |
|
|
|
## Acknowledgement |
|
|
|
Special thanks to https://www.sns.com.my and Nvidia for 8x H100 node! |