128k Context GGUF, please?

#2
by MikeNate - opened

Hi Bartowski,

I'm new to the scene but am enjoying your work. Thank you for your contributions to the community in pushing out all these quants so quickly!

I was wondering if you will also be creating GGUFs with the 128k context using YaRN? I tried the Unsloth ones but they keep causing my Ollama instance to crash when I push it past 30k context. Yours seems to work more stably for me.

Thank you so much!

The crash might had a different reason but these GGUF's metadata indeed shows

llama_model_loader: - kv  13:                    qwen3moe.context_length u32              = 32768
print_info: n_ctx_train      = 32768

Although running with scaling for short context tasks is believed to degrade performance. For long tasks is it better to run with passing parameters like this

llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768

What he said ^

Making the 128k GGUFs don't make much sense imo since we have runtime args that do the same thing, rather than download entirely new models

Thanks for the write-up @KeyboardMasher , I may update my readme so that info is more central

I guess I didn't know how to add the runtime arguments for Ollama and thought that the only way was to have the config.json modified when the GGUF is created (such as the Unsloth 128k models).

I'm going to try out llama.cpp so this might be moot.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment