moonshotai/Kimi-K2-Instruct · I may not know the maybe obvious answer to this but I'm curious.

Given a fixed training budget, MoE models consistently outperform dense ones. A quick way to estimate LLM training FLOPs is 6ND (N = active parameters, D = tokens). A 32B/1T MoE therefore uses far fewer FLOPs than a 1T dense model. With more training budget, we may simply scale the MoE further. Notice that, the larger total parameter will lead to larger communication cost, so it is not totally free. We will discuss this trade-off in K2 tech report

For deeper dives about MOE, I recommend two excellent papers:
https://arxiv.org/pdf/2202.08906
https://arxiv.org/abs/2401.06066