Mungert/QwQ-32B-GGUF · Why different jinja template?

17 days ago

After i noticed different performance by different quants, i compared this jinja template with the one that Unsloth quants have, and there is difference.

Mungert

Owner 17 days ago

The chat template would have been taken automatically from the original tokenizer_config.json from https://huggingface.co/Qwen/QwQ-32B . I will look into Unsloths Bug fixes and see if its worth redoing.

urtuuuu

17 days ago

•

edited 17 days ago

I tried to use Unsloth template with your quant (q3km), but still the performance is different for some reason. In LM Studio, asking default question "What is the capital of France?", makes it think a bit too much, while Unsloths same q3km thinking time is shorter. All settings exactly same. Temp 0.6, min_p 0.01, top_p 0.95, top_k 0.40, rep_p turned off. I don't know, maybe it's supposed to be like this with different quants. Somehow i always go back to unsloth. Also tried bartowski quants before... but somehow same story.

Mungert

Owner 17 days ago

Interesting that unsloths model seems to perform better . Looking at the quants in detail. There are two differences embeddings and output tensors . the rest are the same. Unsloth token_embd.weight Q3_K . My model token_embd.weight Q5_K . I always use higher than the default settings in llama-quantize for the embeddings and output tensors as I found it makes a big difference and adds relatively small amount to model size. I think unsloth would agree higher quants on the embedding tensor is better. I have noticed their tendency to set higher embedding quants , that would explain why in general unsloth is better than bartoski. There is also a difference in the output weights unsloth q6_K rather than my q5_k I use . so that is probably why you see the difference . I think based on the fact that that going from q5_k to q6_k does not add a lot to model size its worth considering for all my model quantization moving forward . I use q6_k on q4_k_m and q4_k_s and q8_0 for q4_k_l . It will take a while but I will post higher quants for the 3 bit output tensors. I have not decided about the chat template. Why do did Qwen and Deepseek force a think token. I am not sure taking it out is going have unintended consequence.

Mungert

Owner 16 days ago

q3_k_m updated with 6_k embeddings and output tensors . Template left as the original