Broken RoPE implementation?
apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1)
in modeling_deepseek.py
has a different implementation than Unsloth, Lieger Kernel, and HF transformers RoPE implementations and gives different results.
Is this intended?
Why does it have this block of code?
b, h, s, d = q.shape
q = q.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)
b, h, s, d = k.shape
k = k.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)
For reference, here is the latest HF transformer implementation of apply_rotary_pos_emb
:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L151
Thanks DeepSeek team for all your hardwork. I am just worried this could be negatively impacting the models output.
FWIW, the code block does seem intentional. Maybe related to how DeepSeek lays out its values for MLA or something. An explanation would be appreciated, but I'll re-read the paper too.
For reference, here is the output of
row 1: DeepSeek v3 RoPE implementation
row 2: HF RoPE implementation
row 3: Lieger Kernel RoPE implementation
row 4: Unsloth RoPE implementation
All implementations match except DeepSeek's, mainly because of that transpose block.
EDIT:
Seems like they use the same impl for v2, https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/modeling_deepseek.py#L339
This calculation method only involves the different positions of the elements in q and k, but the values are still the same. The elements in q and k also correspond one-to-one. After calculating q * kT, the result will be consistent with the HF RoPE implementation.