Question on the "Summarizing it all" figure
#111
by
EPFL-MLO
- opened
Thank you so much for this amazing post!
I'm trying to understand the "Summarizing it all" figure and I can't figure out how the transition from SP to TP domain happens after the Router block.
The data shape at the end of SP has its sequence dimension divided by TP and the full hidden dimension. After the 1st Feed Forward Expert i, why isn't the sequence length divided by TP anymore? Is there an All-gather missing from the figure happening before the Feed Forward?