Long context: YaRN max_position_embeddings 32K or 40k?

#10
by stev236 - opened

Thanks for the nice model, but unfortunately, I don't get as good results with Qwen3 using YaRN (4x) as I did with Qwen2.5-1M (due to instruction following failures). Is that expected?

To me, the long context performance of the 1M model is what made Qwen 2.5 superior to others. Will the Qwen team also be releasing a Qwen3-1M model at a later date?

Regarding max_position_embeddings, the config.json file shows it as 40960 (I presume 8k was added for reasoning tokens), while the model card suggests enabling YaRN with a setting that is 4x 32768. Is that an error? Could we not extend it to 4x 40960?

"rope_scaling": {
    "rope_type": "yarn",
    "factor": 4.0,
    "original_max_position_embeddings": 32768
}

Thanks for any help.

I presume 8k was added for reasoning tokens

I think it's the opposite, 32k is allocated for reasoning and 8k is allocated for the actual answer πŸ˜…

I'm also curious about the original YARN context length. It would be nice to get some input from a Qwen team member here

I think it's the opposite, 32k is allocated for reasoning and 8k is allocated for the actual answer

Nevermind, I think you're right now that I've thought about it :) since the reasoning is supposed to be removed after every turn.

As for the original_max_position_embedding I'm currently assuming that it's natively 32k and using YARN with a factor of 2.0 to get 65536 context, and it's working pretty well. I don't notice any degradation.

Sign up or log in to comment