Potential issue with imatrix
So I did some eval on a few small quants that I made with modified recipe..
- IQ2_K for Routed Experts layers, IQ4_K for the rest, no imatrix
click
# 93 Repeating Layers [0-92] # Attention blk\..*\.attn_q.*=iq4_k blk\..*\.attn_k.*=iq4_k blk\..*\.attn_v.*=iq4_k blk\..*\.attn_output.*=iq4_k # First 3 Dense Layers [0-2] blk\..*\.ffn_down\.weight=iq4_k blk\..*\.ffn_(gate|up)\.weight=iq4_k # Shared Expert Layers [3-92] blk\..*\.ffn_down_shexp\.weight=iq4_k blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_k # Routed Experts Layers [3-92] blk\..*\.ffn_down_exps\.weight=iq2_k blk\..*\.ffn_(gate|up)_exps\.weight=iq2_k # NextN MTP Layer [92] blk\..*\.nextn\.embed_tokens\.weight=iq4_k blk\..*\.nextn\.shared_head_head\.weight=iq4_k blk\..*\.nextn\.eh_proj\.weight=iq4_k # Non-Repeating Layers token_embd\.weight=iq4_k output\.weight=iq4_k
- IQ2_K for Routed Experts layers, IQ4_K for the rest, with imatrix from 8eadb1a
click (same as above)
# 93 Repeating Layers [0-92] # Attention blk\..*\.attn_q.*=iq4_k blk\..*\.attn_k.*=iq4_k blk\..*\.attn_v.*=iq4_k blk\..*\.attn_output.*=iq4_k # First 3 Dense Layers [0-2] blk\..*\.ffn_down\.weight=iq4_k blk\..*\.ffn_(gate|up)\.weight=iq4_k # Shared Expert Layers [3-92] blk\..*\.ffn_down_shexp\.weight=iq4_k blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_k # Routed Experts Layers [3-92] blk\..*\.ffn_down_exps\.weight=iq2_k blk\..*\.ffn_(gate|up)_exps\.weight=iq2_k # NextN MTP Layer [92] blk\..*\.nextn\.embed_tokens\.weight=iq4_k blk\..*\.nextn\.shared_head_head\.weight=iq4_k blk\..*\.nextn\.eh_proj\.weight=iq4_k # Non-Repeating Layers token_embd\.weight=iq4_k output\.weight=iq4_k
- IQ2_KL for Routed Experts layers, IQ4_KSS for the rest, no imatrix
click
# 93 Repeating Layers [0-92] # Attention blk\..*\.attn_q.*=iq4_kss blk\..*\.attn_k.*=iq4_kss blk\..*\.attn_v.*=iq4_kss blk\..*\.attn_output.*=iq4_kss # First 3 Dense Layers [0-2] blk\..*\.ffn_down\.weight=iq4_kss blk\..*\.ffn_(gate|up)\.weight=iq4_kss # Shared Expert Layers [3-92] blk\..*\.ffn_down_shexp\.weight=iq4_kss blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_kss # Routed Experts Layers [3-92] blk\..*\.ffn_down_exps\.weight=iq2_kl blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl # NextN MTP Layer [92] blk\..*\.nextn\.embed_tokens\.weight=iq4_kss blk\..*\.nextn\.shared_head_head\.weight=iq4_kss blk\..*\.nextn\.eh_proj\.weight=iq4_kss # Non-Repeating Layers token_embd\.weight=iq4_kss output\.weight=iq4_kss
- IQ2_KL for Routed Experts layers, IQ4_KSS for the rest, with imatrix from f5d4711
click (same as above)
# 93 Repeating Layers [0-92] # Attention blk\..*\.attn_q.*=iq4_kss blk\..*\.attn_k.*=iq4_kss blk\..*\.attn_v.*=iq4_kss blk\..*\.attn_output.*=iq4_kss # First 3 Dense Layers [0-2] blk\..*\.ffn_down\.weight=iq4_kss blk\..*\.ffn_(gate|up)\.weight=iq4_kss # Shared Expert Layers [3-92] blk\..*\.ffn_down_shexp\.weight=iq4_kss blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_kss # Routed Experts Layers [3-92] blk\..*\.ffn_down_exps\.weight=iq2_kl blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl # NextN MTP Layer [92] blk\..*\.nextn\.embed_tokens\.weight=iq4_kss blk\..*\.nextn\.shared_head_head\.weight=iq4_kss blk\..*\.nextn\.eh_proj\.weight=iq4_kss # Non-Repeating Layers token_embd\.weight=iq4_kss output\.weight=iq4_kss
The eval does code refactoring on a bunch of files, run with top_k=1, and the output is compared with the "golden" files (a little bit like https://www.youtube.com/watch?v=8hQG7QlcLBk). The total input tokens is about 222k, and the total output tokens is about 32k.
The 1st quant made some mistakes on a couple of files:
- changed
logger.Info
andlogger.Error
tologger.info
andlogger.error
(The ability to use proper case seems to be the 1st thing that goes, when a model is quantized too much, or when--smart-expert-reduction
is set too aggressively, or when one becomes the CEO of OpenAI) - changed
)))
to))))
The 2nd quant passed the test perfectly. This shows that the imatrix really helps.
The 3rd quant also passed the test perfectly. For this specific test, it looks like going from IQ2_K (2.375 bpw) to IQ2_KL (2.6875 bpw) is sufficient for GLM-4.5 to perform, without requiring imatrix.
Now the surprise is the 4th quant. With imatrix, somehow it made a single mistake. There is a function call with the word "Cards" in it, let's call it abc.XxxYyyCardsZzz
. The 4th quant somehow wrote it as abc.XxxYyyCardZzz
, i.e. with a missing "s".
The logprobs for this particular token:
model | 1st choice | 2nd choice |
---|---|---|
fireworks (fp16?) | Cards (99.999%) | Cars |
1st quant | Cards (99.997%) | Cars |
2nd quant | Cards (99.997%) | Cars |
3rd quant | Cards (99.999%) | Card |
4th quant | Card (68.228%) | Cards (31.771%) |
I see that ubergarm-imatrix-calibration-corpus-v02.txt does contain a lot more "card" / "cards" / "Card" / "Cards", compared to calibration_data_v5_rc.txt that was previously used for the other models. However, somehow this only impacts the 4th quant, but not the 2nd quant. Perhaps IQ4_KSS is too lossy for the attention layers? Will juice them up and do more testing..
Interesting you're digging in deep. Glad you're able to use the recipes as a base to experiment further. I believe I used ubergarm-imatrix-calibration-corpus-v02.txt for both of the GLM-4.5 imatrix' created in this repo.
The original one was experimental with the NextN tensors ripped out and including imatrix data for the final attn/ffn layer as it was used in inferencing implementation at the time. The new one which is the one used with all released models is missing that final layer data as that final layer is completely skipped in inferencing now (which seems to be the correct implementation as doing it that way resulted in slightly lower Perplexity with the Q8_0 test).
too lossy for the attention layers? Will juice them up and do more testing.
In some testing with DeepSeek-R1-0528 I made about 10 test quants and graphed perplexity varying only first N dense ffn layers, shexp, and attn.* (which are the ones typically offloaded onto GPU which I keep slightly juiced above the routed exps). In that experiment iq5_ks
was the sweet spot for that model.
I often make a few test quants to try to suss out how much attn and those layers can be quantized and its a trade-off between minimizing perplexity and keeping the tensor sizes lower for faster Token Generation speeds.
Anyway, keep us posted with your findings and yeah my intuition too is that IQ2_KL at 2.69 BPW is about as small as I'd like to make ffn_(gate|up)_exps
and quality begins to fall off pretty quick below that with the GLM models. It feels like DeepSeek can go a little bit lower but I don't have as good of charts from those models to show it clearly.