Malaysian Qwen 2.5 7B Instruct Dialect Reasoning GRPO
Online Reinforcement learning using GRPO full parameter on warmup reasoning SFT https://huggingface.co/mesolitica/Malaysian-Qwen2.5-7B-Reasoning-SFT on highly curated Malay Dialect Reasoning dataset.
Improvement
- Improve reasoning on Dialects, each datapoint been replicated to 6 generations.
- Actual online reinforcement learning.
Better performance
To get better performance, use system prompt You are going to enter reasoning mode. First, you try to think step-by-step in Malay. After that, put your final answer within $\\boxed{}$.
, you can check how we trained it at https://github.com/mesolitica/malaya/blob/master/session/qwen2.5/grpo.py#L80
Training session
Finetune on huseinzol05/malaysian-dialect-qa, this is train set from mesolitica/Malay-Dialect-Reasoning.
How we train
- GRPO full parameters.
- WanDB at https://wandb.ai/huseinzol05/fpf-Malaysian-Qwen2.5-7B-Reasoning-SFT-GRPO-v3
Checkpoints
- Epoch 1.1, revision 78435e1edc593a842e6031ba6ee7a5930d9d2a83
- Epoch 2.0, revision 25d418d0032f08c39a506b529e6133e60f998a61
- Epoch 2.96, revision 4c6886a43f73767be61d67093f20dbdf1a7d8df6
Source code
Source code at https://github.com/mesolitica/malaya/blob/master/session/qwen2.5/7b-grpo-fsdp.sh
Benchmark
All the benchmarks generate using vLLM, evaluation based on sacrebleu CHRF max@5.
Source code for evaluation at https://github.com/mesolitica/malaya/tree/master/session/qwen2.5/evaluate-dialect
Float32
Dialect to standard Malay,
From: johor To: malay, score: 58.2189619529139
From: kedah To: malay, score: 59.21260384746205
From: pahang To: malay, score: 53.506270589822165
From: negeri sembilan To: malay, score: 56.94870448682657
From: kelantan To: malay, score: 50.64768195652429
From: penang To: malay, score: 62.964413639258034
From: melaka To: malay, score: 56.24541676643081
average: 56.82057903417684
Standard Malay to dialect,
From: malay To: johor, score: 54.83246740931249
From: malay To: kedah, score: 59.069394967356274
From: malay To: pahang, score: 59.695207458023745
From: malay To: negeri sembilan, score: 50.69885056697714
From: malay To: kelantan, score: 44.66310165425512
From: malay To: penang, score: 65.39795752468879
From: malay To: melaka, score: 72.39183991789344
average: 58.10697421407243
Float16
Dialect to standard Malay,
From: johor To: malay, score: 57.42949426937456
From: kedah To: malay, score: 58.12580212528728
From: pahang To: malay, score: 55.60484906845884
From: negeri sembilan To: malay, score: 56.4509629484568
From: kelantan To: malay, score: 53.944979416369996
From: penang To: malay, score: 62.20935643642939
From: melaka To: malay, score: 57.14492955494046
average: 57.27291054561676
Standard Malay to dialect,
From: malay To: johor, score: 55.68356840259747
From: malay To: kedah, score: 56.264707994950186
From: malay To: pahang, score: 60.15982036912563
From: malay To: negeri sembilan, score: 48.71725827604103
From: malay To: kelantan, score: 43.948995049469474
From: malay To: penang, score: 63.15864675162173
From: malay To: melaka, score: 74.12398375006538
average: 57.436711513410124
Special thanks
Special thanks to https://www.sns.com.my and Nvidia for 8x H100 node!
- Downloads last month
- 2
Model tree for mesolitica/Malaysian-Qwen2.5-7B-Dialect-Reasoning-GRPO
Base model
mesolitica/Malaysian-Qwen2.5-7B-Instruct