Broken RoPE implementation?

#83
by rickyd66 - opened

apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1) in modeling_deepseek.py

has a different implementation than Unsloth, Lieger Kernel, and HF transformers RoPE implementations and gives different results.
Is this intended?

Why does it have this block of code?

    b, h, s, d = q.shape
    q = q.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)

    b, h, s, d = k.shape
    k = k.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)

For reference, here is the latest HF transformer implementation of apply_rotary_pos_emb:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L151

Thanks DeepSeek team for all your hardwork. I am just worried this could be negatively impacting the models output.

FWIW, the code block does seem intentional. Maybe related to how DeepSeek lays out its values for MLA or something. An explanation would be appreciated, but I'll re-read the paper too.

image.png

For reference, here is the output of
row 1: DeepSeek v3 RoPE implementation
row 2: HF RoPE implementation
row 3: Lieger Kernel RoPE implementation
row 4: Unsloth RoPE implementation

All implementations match except DeepSeek's, mainly because of that transpose block.

EDIT:
Seems like they use the same impl for v2, https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/modeling_deepseek.py#L339

rickyd66 changed discussion title from Broken RoPE Embedding? to Broken RoPE implementation?

image.png

Possibly like this to only apply RoPE to half. Interesting.

EDIT: Nevermind that doesn't make sense.

image.png

Seems to make sense. Not sure why no other RoPE implementation is doing this though. Maybe already in the right format there.

rickyd66 changed discussion status to closed

This calculation method only involves the different positions of the elements in q and k, but the values are still the same. The elements in q and k also correspond one-to-one. After calculating q * kT, the result will be consistent with the HF RoPE implementation.

Sign up or log in to comment