Why do we need to hardcode self._attn_implementation = "eager"
#35
by
shantanuagarwal
- opened
Thanks a lot for making the code public. Looking into modeling_nvembed.py
file, I notice two things:
layer.self_attn.is_causal = False
. This makes sense as we want to enforce bi-directionality.- However, what I am not understanding is, why do we need to enforce that the attention implementation be
eager
? So, sdpa/flash_attention_2 is not supported?
What I am trying to understand is, what would need to change in BidirectionalMistralModel
's forward
to make it compatible with sdpa
/flash_attention_2
?