Potential issue with imatrix

by sokann - opened Aug 7

Aug 7

So I did some eval on a few small quants that I made with modified recipe..

IQ2_K for Routed Experts layers, IQ4_K for the rest, no imatrix

click

# 93 Repeating Layers [0-92]

# Attention
blk\..*\.attn_q.*=iq4_k
blk\..*\.attn_k.*=iq4_k
blk\..*\.attn_v.*=iq4_k
blk\..*\.attn_output.*=iq4_k

# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=iq4_k
blk\..*\.ffn_(gate|up)\.weight=iq4_k

# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=iq4_k
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_k

# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq2_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_k

# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq4_k
blk\..*\.nextn\.shared_head_head\.weight=iq4_k
blk\..*\.nextn\.eh_proj\.weight=iq4_k

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq4_k

IQ2_K for Routed Experts layers, IQ4_K for the rest, with imatrix from 8eadb1a

click (same as above)

# 93 Repeating Layers [0-92]

# Attention
blk\..*\.attn_q.*=iq4_k
blk\..*\.attn_k.*=iq4_k
blk\..*\.attn_v.*=iq4_k
blk\..*\.attn_output.*=iq4_k

# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=iq4_k
blk\..*\.ffn_(gate|up)\.weight=iq4_k

# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=iq4_k
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_k

# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq2_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_k

# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq4_k
blk\..*\.nextn\.shared_head_head\.weight=iq4_k
blk\..*\.nextn\.eh_proj\.weight=iq4_k

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq4_k

IQ2_KL for Routed Experts layers, IQ4_KSS for the rest, no imatrix

click

# 93 Repeating Layers [0-92]

# Attention
blk\..*\.attn_q.*=iq4_kss
blk\..*\.attn_k.*=iq4_kss
blk\..*\.attn_v.*=iq4_kss
blk\..*\.attn_output.*=iq4_kss

# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=iq4_kss
blk\..*\.ffn_(gate|up)\.weight=iq4_kss

# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=iq4_kss
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_kss

# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq2_kl
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq4_kss
blk\..*\.nextn\.shared_head_head\.weight=iq4_kss
blk\..*\.nextn\.eh_proj\.weight=iq4_kss

# Non-Repeating Layers
token_embd\.weight=iq4_kss
output\.weight=iq4_kss

IQ2_KL for Routed Experts layers, IQ4_KSS for the rest, with imatrix from f5d4711

click (same as above)

# 93 Repeating Layers [0-92]

# Attention
blk\..*\.attn_q.*=iq4_kss
blk\..*\.attn_k.*=iq4_kss
blk\..*\.attn_v.*=iq4_kss
blk\..*\.attn_output.*=iq4_kss

# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=iq4_kss
blk\..*\.ffn_(gate|up)\.weight=iq4_kss

# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=iq4_kss
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_kss

# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq2_kl
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq4_kss
blk\..*\.nextn\.shared_head_head\.weight=iq4_kss
blk\..*\.nextn\.eh_proj\.weight=iq4_kss

# Non-Repeating Layers
token_embd\.weight=iq4_kss
output\.weight=iq4_kss

The eval does code refactoring on a bunch of files, run with top_k=1, and the output is compared with the "golden" files (a little bit like https://www.youtube.com/watch?v=8hQG7QlcLBk). The total input tokens is about 222k, and the total output tokens is about 32k.

The 1st quant made some mistakes on a couple of files:

changed logger.Info and logger.Error to logger.info and logger.error (The ability to use proper case seems to be the 1st thing that goes, when a model is quantized too much, or when --smart-expert-reduction is set too aggressively, or when one becomes the CEO of OpenAI)
changed ))) to ))))

The 2nd quant passed the test perfectly. This shows that the imatrix really helps.

The 3rd quant also passed the test perfectly. For this specific test, it looks like going from IQ2_K (2.375 bpw) to IQ2_KL (2.6875 bpw) is sufficient for GLM-4.5 to perform, without requiring imatrix.

Now the surprise is the 4th quant. With imatrix, somehow it made a single mistake. There is a function call with the word "Cards" in it, let's call it abc.XxxYyyCardsZzz. The 4th quant somehow wrote it as abc.XxxYyyCardZzz, i.e. with a missing "s".

The logprobs for this particular token:

model	1st choice	2nd choice
fireworks (fp16?)	Cards (99.999%)	Cars
1st quant	Cards (99.997%)	Cars
2nd quant	Cards (99.997%)	Cars
3rd quant	Cards (99.999%)	Card
4th quant	Card (68.228%)	Cards (31.771%)

I see that ubergarm-imatrix-calibration-corpus-v02.txt does contain a lot more "card" / "cards" / "Card" / "Cards", compared to calibration_data_v5_rc.txt that was previously used for the other models. However, somehow this only impacts the 4th quant, but not the 2nd quant. Perhaps IQ4_KSS is too lossy for the attention layers? Will juice them up and do more testing..

ubergarm

Owner Aug 7

@sokann

Interesting you're digging in deep. Glad you're able to use the recipes as a base to experiment further. I believe I used ubergarm-imatrix-calibration-corpus-v02.txt for both of the GLM-4.5 imatrix' created in this repo.

The original one was experimental with the NextN tensors ripped out and including imatrix data for the final attn/ffn layer as it was used in inferencing implementation at the time. The new one which is the one used with all released models is missing that final layer data as that final layer is completely skipped in inferencing now (which seems to be the correct implementation as doing it that way resulted in slightly lower Perplexity with the Q8_0 test).

too lossy for the attention layers? Will juice them up and do more testing.

In some testing with DeepSeek-R1-0528 I made about 10 test quants and graphed perplexity varying only first N dense ffn layers, shexp, and attn.* (which are the ones typically offloaded onto GPU which I keep slightly juiced above the routed exps). In that experiment iq5_ks was the sweet spot for that model.

I often make a few test quants to try to suss out how much attn and those layers can be quantized and its a trade-off between minimizing perplexity and keeping the tensor sizes lower for faster Token Generation speeds.

Anyway, keep us posted with your findings and yeah my intuition too is that IQ2_KL at 2.69 BPW is about as small as I'd like to make ffn_(gate|up)_exps and quality begins to fall off pretty quick below that with the GLM models. It feels like DeepSeek can go a little bit lower but I don't have as good of charts from those models to show it clearly.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment