I may not know the maybe obvious answer to this but I'm curious.

#20
by drmcbride - opened

I understand the massive speed increase of using MOE so that we only activate 32b param per token but why not just train a dense 1T param model and just win on all benchmarks? what gets in the way of that? I'm sure there is a correct answer I'm missing I just want to know what it is.

Moonshot AI org

Given a fixed training budget, MoE models consistently outperform dense ones. A quick way to estimate LLM training FLOPs is 6ND (N = active parameters, D = tokens). A 32B/1T MoE therefore uses far fewer FLOPs than a 1T dense model. With more training budget, we may simply scale the MoE further. Notice that, the larger total parameter will lead to larger communication cost, so it is not totally free. We will discuss this trade-off in K2 tech report

For deeper dives about MOE, I recommend two excellent papers:
https://arxiv.org/pdf/2202.08906
https://arxiv.org/abs/2401.06066

Sign up or log in to comment