Performance of the dynamic quants compared to usual quants?

#21

by inputout - opened about 18 hours ago

about 18 hours ago

•

Thanks first of all, I read "https://unsloth.ai/blog/deepseekr1-dynamic" and dynamic quants are a great development and shows the fantastic ability to obtain a usable performance despite small quantifications (shown at the Flappy Bird game).
But I do not understand how the dynamic quants can be classified in terms of performance compared to the usual quants e.g. @bartowski IQ4_XS, IQ3_M, IQ3_S, IQ3_XXS, IQ2_M... (https://huggingface.co/bartowski/DeepSeek-R1-GGUF).
ADDED NOTE: By "performance" i mean how well the LLM gives correct/intelligent answers (like Arena Score in chatbot-arena-leaderboard) and not the speed of execution.
I can't find any benchmarks for real comparison (the leaderboards only show unquantified performance).
Can e.g. the dynamic quant Q2_K_XL achieve the performance of a IQ4_XS?
If you were to draw up a kind of ranking list, how would you roughly categorize the new quants in comparison to the usual quants?
It would be great to know which dynamic quant corresponds to which usual quant in terms of performance. For example:

212GB Q2_K_XL 2.51-bit(MoE) 3.5/2.5bit(Down_proj): corresponds approximately to: IQ4_XS, IQ3_M, IQ3_S, IQ3_XXS, IQ2_M... ??
183GB IQ2_XXS 2.22-bit(MoE) 2.5/2.06bit(Down_proj): corresponds approximately to: IQ4_XS, IQ3_M, IQ3_S, IQ3_XXS, IQ2_M... ??

TobDeBer

about 18 hours ago

This graph will give a rough idea: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

TobDeBer

about 18 hours ago

In my understanding, unsloth dynamic quant uses the same quant types as bartowski and others, it just decides per layer and per matrix which type to use. And decides differently than the default gguf quantization code.

inputout

about 16 hours ago

This graph will give a rough idea: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

exactly, i know it is very useful for the usual quants but does not show where the four unsloth dynamic quants are located.

In my understanding, unsloth dynamic quant uses the same quant types as bartowski and others, it just decides per layer and per matrix which type to use. And decides differently than the default gguf quantization code.

Yes it is a mixture of different bits, that is basically clear. The site provides this information: (https://unsloth.ai/blog/deepseekr1-dynamic)

First 3 dense layers use 0.5% of all weights leave as 4 or 6bit
MoE layers use shared experts, using 1.5% of weights. We’ll use 6bit.
MLA attention modules as 4 or 6bit, using <5% of weights
leaves ~88% of the weights can massively shrink.

1.58-bit 131GB IQ1_S: Range 1.58 to 4/6 bit
1.73-bit 158GB IQ1_M: Range 1.73 to 4/6 bit
2.22-bit 183GB IQ2_XXS: Range 2.22 to 4/6 bit
2.51-bit 212GB Q2_K_XL: Range 2.51 to 4/6 bit

According to my logic, it is therefore completely unclear where in the large range (1.58/1.73/2.22/2.51 to 4/6 bit) the dynamic quants stand in relation of performance. Based on the bits, it doesn't seem possible because they are fluid.
Therefore it would be great approx to know which dynamic quant corresponds to which usual quant in terms of performance. For me in the first step it would even be enough to know Q2_K_XL vs IQ4_XS (Similar performance, worse, better?). Hopefully someone with suitable hardware will make performance benchmarks so that the four unsloth dynamic quants can be roughly placed in context with usual quants.

TobDeBer

about 16 hours ago

I have a project ongoing to compute perplexity for the 4 low bit dynamic quant versions and compare to Q8 or FP8.
It'll take time and money since I don't own HW that can do that and have to rent a large cloud instance.

Rotating

about 14 hours ago

I've read that the dynamic Q2s run significantly faster than the dynamic Q1s on the same machine.

It's a bit of a moot point, and I won't be testing it by downloading a Q1 because I can run Q2_K_XL so why sacrifice quality.

inputout

about 14 hours ago

•

edited about 13 hours ago

By "performance" i mean how well the LLM gives correct/intelligent answers. Sorry maybe i should have worded it more clearly.
So the question was about the "intelligence" of the LLMs (like Arena Score in chatbot-arena-leaderboard) and not the speed of execution.
I am not a native speaker, sorry maybe "performance" was misleading, is there technically unmistakably better word for that??
(Maybe ist "perplexity" or "accuracy" better?, i could modify the headline)

Rotating

about 12 hours ago

By "performance" i mean how well the LLM gives correct/intelligent answers. Sorry maybe i should have worded it more clearly.
So the question was about the "intelligence" of the LLMs (like Arena Score in chatbot-arena-leaderboard) and not the speed of execution.
I am not a native speaker, sorry maybe "performance" was misleading, is there technically unmistakably better word for that??
(Maybe ist "perplexity" or "accuracy" better?, i could modify the headline)

Oh right I understand now! Yes, this is definitely important to work out.

inputout

about 11 hours ago

•

edited about 11 hours ago

I have a project ongoing to compute perplexity for the 4 low bit dynamic quant versions and compare to Q8 or FP8.
It'll take time and money since I don't own HW that can do that and have to rent a large cloud instance.

I'm not so familiar with this, is the perplexity metric suitable for this type of model with MoE architecture and Reasoning/CoT? But at the other hand, getting a deviation from Q8/FP8 is very valuable either way. It becomes particularly interesting when the deviation of the dynamic quants is then compared with the deviation of the usual quants (IQ4_XS, IQ3_M, ...). So it roughly estimate how much “intelligence” remains und where they are located between the usual quants.
I found this: https://www.reddit.com/r/LocalLLaMA/comments/1idi5cr/i_did_a_very_short_perplexity_test_with_deepseek/ (But I can't interpret it).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment