Post
2390
Muon vs MuonClip vs Muon+Adamw
Muon has gone from an experiment to a mainstream optimizer, but does it hold up for fineâtuning? We ran headâtoâhead tests on Qwen3â4B (10k+ highâquality instruction rows) to find out.
Short story: Pure Muon converged fastest at the start, but its gradientânorm spikes made training unstable. MuonClip (Kimi K2âs clipping) stabilizes long pretraining runs, yet in our smallâscale fineâtune it underperformed, lower token accuracy and slower convergence. The winner was the hybrid: Muon for 2D layers + AdamW for 1D layers. It delivered the best balance of stability and final performance and even beat vanilla AdamW.
Takeaway: for small-scale fine-tuning, hybrid = practical and reliable.
Next Step: scale to larger models/datasets to see if Muonâs spikes become catastrophic or if clipping wins out.
Full Blog Link: https://huggingface.co/blog/KingNish/optimizer-part1
Muon has gone from an experiment to a mainstream optimizer, but does it hold up for fineâtuning? We ran headâtoâhead tests on Qwen3â4B (10k+ highâquality instruction rows) to find out.
Short story: Pure Muon converged fastest at the start, but its gradientânorm spikes made training unstable. MuonClip (Kimi K2âs clipping) stabilizes long pretraining runs, yet in our smallâscale fineâtune it underperformed, lower token accuracy and slower convergence. The winner was the hybrid: Muon for 2D layers + AdamW for 1D layers. It delivered the best balance of stability and final performance and even beat vanilla AdamW.
Takeaway: for small-scale fine-tuning, hybrid = practical and reliable.
Next Step: scale to larger models/datasets to see if Muonâs spikes become catastrophic or if clipping wins out.
Full Blog Link: https://huggingface.co/blog/KingNish/optimizer-part1