Do you really use flash attention?

by GinnM - opened Mar 6

Discussion

GinnM

Mar 6

•

edited Mar 6

I noticed that:

        attn = scaled_dot_product_attention(
            query=xq.transpose(1, 2),
            key=xk.transpose(1, 2),
            value=xv.transpose(1, 2),
            attn_mask=attention_mask.bool(),
            dropout_p=0,
        ).transpose(1, 2)

But in the scenario that the attn_mask parameter is not None, scaled_dot_product_attention will not use flash attention actually.

thaodd11

May 19

I tried both sdpa and flash attention and both got error:
RuntimeError: Failed to load AutoModel for chandar-lab/NeoBERT. Error: NeoBERT does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment