How about hyper-fragmented sparse MoE ?
Thanks for open-sourcing such a large-scale MoE model! Kimi K2's performance is super impressive!
I've been wondering lately, what if we made the experts in MoE even smaller and more fragmented? We've done some small experiments on our end and found that by combining strategies like "gradient minimization selective update", it seems to naturally force these tiny experts to functionally differentiate themselves, each taking on specific roles, even sharing experts emerge naturally. Combined with dynamic Top_k routing...
Feels like this could push sparsity to a new level, making models even more efficient and flexible. Wondering if you guys have considered this direction?
There are existing work showing that fine-grained MoEs perform better than coarse-grained ones and they have fascinating properties. We definitely considered this direction, but It is not the main focus of K2 though.