Going deeper vs wider
Hi again! How did you arrive at the specific configuration for num_hidden_layers
, hidden_size
and intermediate_size
?
Given a fixed parameter (or latency) budget, do you have any insights into what helps the model quality? Whether it's adding more layers, or reducing the hidden/intermediate sizes
Thanks!
Width vs depth does not matter too much see https://arxiv.org/abs/2001.08361 so we just went with values from prior work
(except for the finegrained experts part which we ablate in the paper)
Thanks for the pointer! It would be interesting to revisit width vs depth with more number of tokens since a lot of trends tend to emerge clearly after ~30-40B tokens (even from the OLMoE paper).
If I read it correctly, https://arxiv.org/abs/2001.08361 trains models up to ~20B tokens which might be too early to indicate these trends.
Oh good point; yes maybe their scales were too small!