MoE Experiments (proper sparse MoEs)
Collection
2 items
โข
Updated
Based off of smollmv2. (Llama) MoE-ified then further trained on a general dataset.
MoE layers: [8, 12, 16, 20, 24, 28]
Top-k: 2 (activates 50.0% of experts per token)
Hidden size: 960
Total parameters: 494,554,560
Trainable parameters: 494,554,560
Auxiliary loss weight: 0.01
code @ https://gist.github.com/cappuch/6a454ec8d2d349a27f9fd84f6ac90554