experimental MoE with 3 experts totalling 480m~ params router is roughly 70M params
no loss chart for this router trained on 15 samples