Is MoE's finetune also autoregressive?How should the different sample gradients in a batch be accumulated?Hope to get your reply!
· Sign up or log in to comment