How to run the 128k models
Great that you have created the 128k models but I don’t understand what’s required to run them?
Eg looking at your doc for llama.cpp
https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html
It mentions that you need the following parameters for 128k context:
-c 131072 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
Are all this included by default in the 128k models? Eg what is the minimum number of arguments I need to run with 128k context using the 128k models?
@rogerooberg I was confused too, but after a quick discussion with AaronFeng47 on reddit I think I understand better.
You can check the GGUF kv metadata to see if unsloth baked in the three values compared to a normal quant in the model card sidebar.
tl;dr; this "128k" model seems to just have a few kv metadata updated with llama.cpp /gguf-py/gguf/scripts/gguf_new_metadata.py
because LM Studio doesn't work with those arguments.
If you're on llama.cpp or other inference engines that accept additional arguments you can use a "normal" model and enable this 128k mode.
@rogerooberg The 128K GGUFs have to be downloaded separately from the normal quants and they're different. If you use a normal quant from other providers like what @ubergarm said, this isn't correct.
First the context length for Qwen 3 is not 32K, it's 40960 - we verified this with the Qwen team. Ie any quant using a 32K context size is actually wrong. We communicated this with the Qwen team during their pre-release and helped resolve issues.
Second, yes enabling YaRN like that is fine, but you MUST calibrate the imatrix importance matrix to account for longer sequence lengths - ie your own importance plots show some differences to the importance matrix since we used 12K context lengths. Yes it's less than 32K, but 12K is much better than 512.
YaRN scales the RoPE embeddings, so doing imatrix on 512 sequence lengths will not be equivalent to doing imatrix on 12K context lengths - note https://blog.eleuther.ai/yarn/ shows shorter contexts degrade in accuracy, so you can't just simply set YaRN and expect the same perf on quantized models. This is only the case for BF16.
Conclusion: You have to downloads these 128K quants specifically optimized for longer context lengths
@danielhanchen Thanks for your post but could clarify (or point to) how to use this model with LM Studio? I.e., do I need to make any changes to the model config after downloading?
Thanks I'll reply once here to keep it simple:
https://www.reddit.com/r/LocalLLaMA/comments/1kju1y1/comment/mru9ob7/
Thanks all for helping out!
@danielhanchen
I am bit confused still, but I am a bit new to this area so forgive me if I am asking too obvious questions :)
To be more specific, looking at your example for a regular model using llama.cpp it says:
./llama-cli -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift
And if you want to run that model with 128k context you should add the Yarn parameters. E.g. in this example it should be executed with (if you want the other parameters to stay the same)
./llama-cli -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 131072 -n 32768 --no-context-shift --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
Now with yout 128k models, besides that you need to download them as well. What would the corresponding command be?
E.g. is this correct?
./llama-cli -hf Qwen/Qwen3-30B-A3B-128K-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 131072 -n 32768 --no-context-shift --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
- Regarding your points is that something I need to configure/tweak or is that just the explanation to what you have been done in the 128k models you have provided?