Performance of the dynamic quants compared to usual quants?

#21
by inputout - opened

Thanks first of all, I read "https://unsloth.ai/blog/deepseekr1-dynamic" and dynamic quants are a great development and shows the fantastic ability to obtain a usable performance despite small quantifications (shown at the Flappy Bird game).
But I do not understand how the dynamic quants can be classified in terms of performance compared to the usual quants e.g. @bartowski IQ4_XS, IQ3_M, IQ3_S, IQ3_XXS, IQ2_M... (https://huggingface.co/bartowski/DeepSeek-R1-GGUF).
ADDED NOTE: By "performance" i mean how well the LLM gives correct/intelligent answers (like Arena Score in chatbot-arena-leaderboard) and not the speed of execution.
I can't find any benchmarks for real comparison (the leaderboards only show unquantified performance).
Can e.g. the dynamic quant Q2_K_XL achieve the performance of a IQ4_XS?
If you were to draw up a kind of ranking list, how would you roughly categorize the new quants in comparison to the usual quants?
It would be great to know which dynamic quant corresponds to which usual quant in terms of performance. For example:

  • 212GB Q2_K_XL 2.51-bit(MoE) 3.5/2.5bit(Down_proj): corresponds approximately to: IQ4_XS, IQ3_M, IQ3_S, IQ3_XXS, IQ2_M... ??
  • 183GB IQ2_XXS 2.22-bit(MoE) 2.5/2.06bit(Down_proj): corresponds approximately to: IQ4_XS, IQ3_M, IQ3_S, IQ3_XXS, IQ2_M... ??

In my understanding, unsloth dynamic quant uses the same quant types as bartowski and others, it just decides per layer and per matrix which type to use. And decides differently than the default gguf quantization code.

This graph will give a rough idea: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

exactly, i know it is very useful for the usual quants but does not show where the four unsloth dynamic quants are located.

In my understanding, unsloth dynamic quant uses the same quant types as bartowski and others, it just decides per layer and per matrix which type to use. And decides differently than the default gguf quantization code.

Yes it is a mixture of different bits, that is basically clear. The site provides this information: (https://unsloth.ai/blog/deepseekr1-dynamic)

  • First 3 dense layers use 0.5% of all weights leave as 4 or 6bit
  • MoE layers use shared experts, using 1.5% of weights. We’ll use 6bit.
  • MLA attention modules as 4 or 6bit, using <5% of weights
  • leaves ~88% of the weights can massively shrink.

1.58-bit 131GB IQ1_S: Range 1.58 to 4/6 bit
1.73-bit 158GB IQ1_M: Range 1.73 to 4/6 bit
2.22-bit 183GB IQ2_XXS: Range 2.22 to 4/6 bit
2.51-bit 212GB Q2_K_XL: Range 2.51 to 4/6 bit

According to my logic, it is therefore completely unclear where in the large range (1.58/1.73/2.22/2.51 to 4/6 bit) the dynamic quants stand in relation of performance. Based on the bits, it doesn't seem possible because they are fluid.
Therefore it would be great approx to know which dynamic quant corresponds to which usual quant in terms of performance. For me in the first step it would even be enough to know Q2_K_XL vs IQ4_XS (Similar performance, worse, better?). Hopefully someone with suitable hardware will make performance benchmarks so that the four unsloth dynamic quants can be roughly placed in context with usual quants.

I have a project ongoing to compute perplexity for the 4 low bit dynamic quant versions and compare to Q8 or FP8.
It'll take time and money since I don't own HW that can do that and have to rent a large cloud instance.

I've read that the dynamic Q2s run significantly faster than the dynamic Q1s on the same machine.

It's a bit of a moot point, and I won't be testing it by downloading a Q1 because I can run Q2_K_XL so why sacrifice quality.

By "performance" i mean how well the LLM gives correct/intelligent answers. Sorry maybe i should have worded it more clearly.
So the question was about the "intelligence" of the LLMs (like Arena Score in chatbot-arena-leaderboard) and not the speed of execution.
I am not a native speaker, sorry maybe "performance" was misleading, is there technically unmistakably better word for that??
(Maybe ist "perplexity" or "accuracy" better?, i could modify the headline)

By "performance" i mean how well the LLM gives correct/intelligent answers. Sorry maybe i should have worded it more clearly.
So the question was about the "intelligence" of the LLMs (like Arena Score in chatbot-arena-leaderboard) and not the speed of execution.
I am not a native speaker, sorry maybe "performance" was misleading, is there technically unmistakably better word for that??
(Maybe ist "perplexity" or "accuracy" better?, i could modify the headline)

Oh right I understand now! Yes, this is definitely important to work out.

I have a project ongoing to compute perplexity for the 4 low bit dynamic quant versions and compare to Q8 or FP8.
It'll take time and money since I don't own HW that can do that and have to rent a large cloud instance.

I'm not so familiar with this, is the perplexity metric suitable for this type of model with MoE architecture and Reasoning/CoT? But at the other hand, getting a deviation from Q8/FP8 is very valuable either way. It becomes particularly interesting when the deviation of the dynamic quants is then compared with the deviation of the usual quants (IQ4_XS, IQ3_M, ...). So it roughly estimate how much “intelligence” remains und where they are located between the usual quants.
I found this: https://www.reddit.com/r/LocalLLaMA/comments/1idi5cr/i_did_a_very_short_perplexity_test_with_deepseek/ (But I can't interpret it).

Sign up or log in to comment