How to run the 128k models

by rogerooberg - opened May 11

May 11

Great that you have created the 128k models but I don’t understand what’s required to run them?
Eg looking at your doc for llama.cpp

https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html
It mentions that you need the following parameters for 128k context:
-c 131072 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768

Are all this included by default in the 128k models? Eg what is the minimum number of arguments I need to run with 128k context using the 128k models?

ubergarm

May 11

@rogerooberg I was confused too, but after a quick discussion with AaronFeng47 on reddit I think I understand better.

You can check the GGUF kv metadata to see if unsloth baked in the three values compared to a normal quant in the model card sidebar.

tl;dr; this "128k" model seems to just have a few kv metadata updated with llama.cpp /gguf-py/gguf/scripts/gguf_new_metadata.py because LM Studio doesn't work with those arguments.

If you're on llama.cpp or other inference engines that accept additional arguments you can use a "normal" model and enable this 128k mode.

danielhanchen

Unsloth AI org May 11

@rogerooberg The 128K GGUFs have to be downloaded separately from the normal quants and they're different. If you use a normal quant from other providers like what @ubergarm said, this isn't correct.

First the context length for Qwen 3 is not 32K, it's 40960 - we verified this with the Qwen team. Ie any quant using a 32K context size is actually wrong. We communicated this with the Qwen team during their pre-release and helped resolve issues.
Second, yes enabling YaRN like that is fine, but you MUST calibrate the imatrix importance matrix to account for longer sequence lengths - ie your own importance plots show some differences to the importance matrix since we used 12K context lengths. Yes it's less than 32K, but 12K is much better than 512.
YaRN scales the RoPE embeddings, so doing imatrix on 512 sequence lengths will not be equivalent to doing imatrix on 12K context lengths - note https://blog.eleuther.ai/yarn/ shows shorter contexts degrade in accuracy, so you can't just simply set YaRN and expect the same perf on quantized models. This is only the case for BF16.

Conclusion: You have to downloads these 128K quants specifically optimized for longer context lengths

davidpfarrell

May 11

@danielhanchen Thanks for your post but could clarify (or point to) how to use this model with LM Studio? I.e., do I need to make any changes to the model config after downloading?

danielhanchen

Unsloth AI org May 12

@davidpfarrell No need - they should load automatically!

ubergarm

May 12

Thanks I'll reply once here to keep it simple:
https://www.reddit.com/r/LocalLLaMA/comments/1kju1y1/comment/mru9ob7/

rogerooberg

May 12

Thanks all for helping out!
@danielhanchen I am bit confused still, but I am a bit new to this area so forgive me if I am asking too obvious questions :)

To be more specific, looking at your example for a regular model using llama.cpp it says:
./llama-cli -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift

And if you want to run that model with 128k context you should add the Yarn parameters. E.g. in this example it should be executed with (if you want the other parameters to stay the same)
./llama-cli -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 131072 -n 32768 --no-context-shift --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768

Now with yout 128k models, besides that you need to download them as well. What would the corresponding command be?
E.g. is this correct?
./llama-cli -hf Qwen/Qwen3-30B-A3B-128K-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 131072 -n 32768 --no-context-shift --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768

Regarding your points is that something I need to configure/tweak or is that just the explanation to what you have been done in the 128k models you have provided?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment