Q4_0 4bit only scored 50.71 on MMLU Pro

#3
by xceptor - opened

unlike non-thinking model which scored 0.7 for Q2 (2bit), this model is performing worse.

logs

+-------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Model                   | Dataset   | Metric          | Subset           |   Num |   Score | Cat.0   |
+=========================+===========+=================+==================+=======+=========+=========+
| Qwen_think_unsloth_4bit | mmlu_pro  | AverageAccuracy | computer science |    10 |  0.4    | default |
+-------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen_think_unsloth_4bit | mmlu_pro  | AverageAccuracy | math             |    10 |  0.6    | default |
+-------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen_think_unsloth_4bit | mmlu_pro  | AverageAccuracy | chemistry        |    10 |  0.8    | default |
+-------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen_think_unsloth_4bit | mmlu_pro  | AverageAccuracy | engineering      |    10 |  0.4    | default |
+-------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen_think_unsloth_4bit | mmlu_pro  | AverageAccuracy | law              |    10 |  0.2    | default |
+-------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen_think_unsloth_4bit | mmlu_pro  | AverageAccuracy | biology          |    10 |  0.8    | default |
+-------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen_think_unsloth_4bit | mmlu_pro  | AverageAccuracy | health           |    10 |  0.5    | default |
+-------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen_think_unsloth_4bit | mmlu_pro  | AverageAccuracy | physics          |    10 |  0.6    | default |
+-------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen_think_unsloth_4bit | mmlu_pro  | AverageAccuracy | business         |    10 |  0.4    | default |
+-------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen_think_unsloth_4bit | mmlu_pro  | AverageAccuracy | philosophy       |    10 |  0.3    | default |
+-------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen_think_unsloth_4bit | mmlu_pro  | AverageAccuracy | economics        |    10 |  0.7    | default |
+-------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen_think_unsloth_4bit | mmlu_pro  | AverageAccuracy | other            |    10 |  0.5    | default |
+-------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen_think_unsloth_4bit | mmlu_pro  | AverageAccuracy | psychology       |    10 |  0.7    | default |
+-------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen_think_unsloth_4bit | mmlu_pro  | AverageAccuracy | history          |    10 |  0.2    | default |
+-------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Qwen_think_unsloth_4bit | mmlu_pro  | AverageAccuracy | OVERALL          |   140 |  0.5071 | -       |
+-------------------------+-----------+-----------------+------------------+-------+---------+---------+

Would be interesting how a 4-bit grouped quantization like Q4_K_M, or even Unsloth's dynamic scheme UD-Q4_K_XL perform...

will give it a try. (These thinking models are yappers, takes so much time to run, consumes a lot of unnecessary tokens)

Best to try UD-Q4_K_XL - also I updated the model just then!

FYI Q4_0 aren't dynamic incase you guys aren't aware! Only quants with UD are!
So best to try UD-Q4_K_XL

Yes, let me try both of them again. Thank you.

Unsloth AI org

:)

Yes, let me try both of them again.

Thank you -- that would hopefully clearly establish for this MoE model that q_4_0 < q_4_k_m < UD-q4_k_xl plus, hopefully at least UD-q4_k_xl is close enough to the performance of native weights.

These thinking models are yappers, takes so much time to run [..]

Yeah, that's no exaggeration. Started the MMLU-Pro (https://github.com/chigkim/Ollama-MMLU-Pro) on a 6-bit quant, but I am not going to be able to complete with my HW :/ Lower quants might yap even more, possibly also exceeding a limited context length on some tasks more often than native/full-precision weights.

For high-profile models (incl. upcoming smaller Qwen3-Coders) it would be really nice to have a subset of key benchmarks for a subset of the quants... Just in case someone GPU-rich can easily do this "for free". In general, I go for the quant that my HW can still run, but maybe a smaller quant (especially a UD one) could potentially deliver similar performance while needing much less resources (power and memory).

FYI Q4_0 aren't dynamic incase you guys aren't aware! Only quants with UD are! @xceptor @lightenup

Thanks, I am aware :) Nevertheless at least for me it would be interesting to confirm in terms of actual model performance (MMLU Pro, HumanEval, ..) how much better your quants are compared to simple grouped quants (Q4_K_M). The static one (Q4_0) would be just a nice base line - maybe encouraging the community to move on, not download/use them, maybe not even generate/distribute them anymore. (apologies, in case I am ignorant of situations where someone depends on static quants.)

Got results after redownload of 4bit Q4_0 model, (testing Q4_K_XL now)

it scored 50.00
One thing if you notice, its not bad at reasoning I guess because look at math physics chemistry (7, 7, 8 out of 10), law, history & philosophy where it has scored less.

+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+
| Model                                         | Dataset   | Metric          | Subset           |   Num |   Score | Cat.0   |
+===============================================+===========+=================+==================+=======+=========+=========+
| unsloth_Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf | mmlu_pro  | AverageAccuracy | computer science |    10 |     0.4 | default |
+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+
| unsloth_Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf | mmlu_pro  | AverageAccuracy | math             |    10 |     0.7 | default |
+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+
| unsloth_Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf | mmlu_pro  | AverageAccuracy | chemistry        |    10 |     0.8 | default |
+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+
| unsloth_Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf | mmlu_pro  | AverageAccuracy | engineering      |    10 |     0.5 | default |
+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+
| unsloth_Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf | mmlu_pro  | AverageAccuracy | law              |    10 |     0.1 | default |
+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+
| unsloth_Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf | mmlu_pro  | AverageAccuracy | biology          |    10 |     0.7 | default |
+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+
| unsloth_Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf | mmlu_pro  | AverageAccuracy | health           |    10 |     0.4 | default |
+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+
| unsloth_Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf | mmlu_pro  | AverageAccuracy | physics          |    10 |     0.7 | default |
+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+
| unsloth_Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf | mmlu_pro  | AverageAccuracy | business         |    10 |     0.4 | default |
+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+
| unsloth_Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf | mmlu_pro  | AverageAccuracy | philosophy       |    10 |     0.2 | default |
+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+
| unsloth_Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf | mmlu_pro  | AverageAccuracy | economics        |    10 |     0.8 | default |
+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+
| unsloth_Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf | mmlu_pro  | AverageAccuracy | other            |    10 |     0.5 | default |
+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+
| unsloth_Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf | mmlu_pro  | AverageAccuracy | psychology       |    10 |     0.6 | default |
+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+
| unsloth_Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf | mmlu_pro  | AverageAccuracy | history          |    10 |     0.2 | default |
+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+
| unsloth_Qwen3-30B-A3B-Thinking-2507-Q4_0.gguf | mmlu_pro  | AverageAccuracy | OVERALL          |   140 |     0.5 | -       |
+-----------------------------------------------+-----------+-----------------+------------------+-------+---------+---------+

Hm, that's basically the same result. Are you using llama.cpp? (because according to https://huggingface.co/unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF/discussions/4 the original gguf files already worked with llama.cpp correctly) How large is your context? If you have detailed benchmark logs, you could check long responses if the model ran out of context before providing the final answer.

Other than having a too small context length, static quants of this model might be just very bad :) Looking forward to your current run regarding Q4_K_XL!
/edited: improved clarity

Yes llama.cpp, but I don't think that is issue because gemma models are working well, check out here.

mmlu-pro

Yes llama.cpp, but I don't think that is issue

No issue - it would just explain why you basically get the same result for q4_0 (0.5071 from your first post vs 0.5 now) despite re-downloading the model and redoing the test: llama.cpp already did the right thing, even with the previous version of the gguf model file.

btw: started on UD-Q3_K_XL and based on the first 28 biology questions (of a total of 717 questions) I have an accuracy result of 0.89 (vs. 0.7 from your results) -> a good first hint that dynamic UD quants even at 1 bit lower outperform static quants.

Got results for Q4_K_XL, it performed worse than Q4_0.

Score: 0.4786

Even though it's a couple % lower, it is still about the same as q4_0... something must be wrong.

What context length (parameter -c) do you set in llama.cpp? As you noticed a lot of tokens are produced, and Qwen themselves suggest a minimum context length of 32 768 tokens (and for complex tasks even 81 920 tokens). Maybe you have set it too low and for many of the more complex questions the model never arrives at a final answer? (then the question will be evaluated as wrong and this ofc lowers the result).

I use 60k as context length.

If you think something is wrong, I tested recent Horizon_alpha which scored 84.29.
I also tested 2bit non-thinking models, surprisingly they scored 70+. Check them out here

Lets see, I am trying this model unsloth/cogito-v2-preview-llama-109B-MoE-GGUF, cogito have published MMLU pro benchmarks on their website.

There shouldn't be too much deviation. Otherwise I will request @daniel to check himself.

If you run the benchmark correctly, there might be problem with the quants.

To me it's strange that you end up around 0.5 for both 4-bit static and for 4-bit dynamic quants; that's imo too big of a drop from native full-precision weights (0.81). I would have expected about up to 10% drop for static 4-bit quants and only up to a couple of % for dynamic 4-bit quants.

Just looking for reasons why the benchmark run might be failing the same way for both quants and giving a wrong result:

  • You probably use the recommended model parameters, https://docs.unsloth.ai/basics/qwen3-2507#thinking-qwen3-30b-a3b-thinking-2507

  • 60k context is ofc huge, but could still be too small for the tougher questions. Not sure which benchmark tool you use, but in detailed logs there should be for each prompt the model answer. If you notice that the answer is cut off (not giving the letter answer which the benchmark eval tool expects), the context length is too small.

  • Am I reading this right: you select from each category just 10 questions and benchmark the model on overall 140 questions (column 'Num' in your results)? -> if yes, then this is a tiny subset (less than 2%) of the whole benchmark. Not sure whether these are enough questions to estimate the whole MMLU-pro result.

0.81? from where did you get this number?

I am running MMLU-PRO (not just MMLU) and not 5 shot but single shot.

just FYI.

0.81? from where did you get this number?

From the model card: https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507#performance

not 5 shot but single shot.

Ah, ok - I assumed you did the standard MMLU-Pro benchmark, 5 shot. Anyway - a drop from 0.81 for 5-shot to 0.5 for 0-shot is IMO still much higher than can be explained by the 4-bit quantizations. Also that you essentially got the same result for both the static and the dynamic quant (dynamic quant should be much better than static) IMO points at an issue.

Let me try 5 shot 1 more time with 131k context length. If that doesn't work than the issue is something else.

(boy these models are yapper)

Sign up or log in to comment