Broken RoPE implementation?

#83

by rickyd66 - opened Feb 16

Discussion

rickyd66

Feb 16

•

edited Feb 16

apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1) in modeling_deepseek.py

has a different implementation than Unsloth, Lieger Kernel, and HF transformers RoPE implementations and gives different results.
Is this intended?

Why does it have this block of code?

    b, h, s, d = q.shape
    q = q.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)

    b, h, s, d = k.shape
    k = k.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)

For reference, here is the latest HF transformer implementation of apply_rotary_pos_emb:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L151

Thanks DeepSeek team for all your hardwork. I am just worried this could be negatively impacting the models output.

FWIW, the code block does seem intentional. Maybe related to how DeepSeek lays out its values for MLA or something. An explanation would be appreciated, but I'll re-read the paper too.

rickyd66

Feb 16

•

edited Feb 16

For reference, here is the output of
row 1: DeepSeek v3 RoPE implementation
row 2: HF RoPE implementation
row 3: Lieger Kernel RoPE implementation
row 4: Unsloth RoPE implementation

All implementations match except DeepSeek's, mainly because of that transpose block.

EDIT:
Seems like they use the same impl for v2, https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/modeling_deepseek.py#L339

rickyd66 changed discussion title from Broken RoPE Embedding? to Broken RoPE implementation? Feb 16

rickyd66

Feb 17

•

edited Feb 18

Possibly like this to only apply RoPE to half. Interesting.

EDIT: Nevermind that doesn't make sense.

rickyd66

Feb 18

•

edited Feb 18

Seems to make sense. Not sure why no other RoPE implementation is doing this though. Maybe already in the right format there.

rickyd66 changed discussion status to closed Feb 18

yulangz

Mar 19

This calculation method only involves the different positions of the elements in q and k, but the values are still the same. The elements in q and k also correspond one-to-one. After calculating q * kT, the result will be consistent with the HF RoPE implementation.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment