Any reason to not use this model over the 256K context model?
If I am correct, the only difference between the 1 million context length upload and the regular upload is that the 1 million context upload has a yarn value set to 4.
This allows it to handle 1 million tokens when it was not trained to do so. My question is, is there any reason not to use this model and just ignore the 256K context, even if I am not going to be using the full 1 million context?
If both models have the same performance but one can handle one million, I might as well just download that one.
https://huggingface.co/Qwen/Qwen2.5-7B-Instruct#processing-long-texts
It says static YARN could affect shorter inputs performance. Not sure whether it applies to this case though.
The biggest reason why 1M differs from our normal GGUFs, is because we utilize very long examples in our calibration dataset
In general if you're not using the context length over 256k, would recommend you to use the standard one.
CC: @CHNtentes @mallorbc @owao @auggie246 @rboehme86 @ijohn07