Why inside `modeling_phi.py`, the output from Self Attention is not becoming the input of MLP?

#94
by fahadh4ilyas - opened

Usually, hidden_states from self_attn will become input into mlp. But, from modeling_phi.py, it seems that hidden_states after input_norm is becoming input both for self_attn and mlp and then added at the end. What kind of transformers implementation is that?

Microsoft org

Hello @fahadh4ilyas !

Attention can also be applied in parallel (instead of sequential, e.g., GPT, Lllama) within regard to the MLP.

Please check GPT-J/CodeGen's implementation: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gptj/modeling_gptj.py#L311.

gugarosa changed discussion status to closed

Sign up or log in to comment