ubergarm/GLM-4.5-GGUF · Any chance for an IQ3

29 days ago

The IQ4_KSS is good - and does fit on 2x6000 Blackwells, but gotta keep context relatively small. Any chance for a IQ3 that could hit that space between IQ4_KSS and IQ2_KL, before ppl starts going off the rails? Maybe at something like ~160GB?

Thanks for all you do!

ubergarm

Owner 29 days ago

•

edited 29 days ago

Yeah maybe so, I just did an IQ3_KS for Air and might be able to use the basic recipe here given they are similar, Will see what I can do!

Also if you're on 2x6k blackwells you can use the KT quants which are from the QTIP paper and similar to exllamava3 EXL3 quants. I know turboderp just got Air going: https://huggingface.co/turboderp/GLM-4.5-Air-exl3 but not sure if full size is available or in the works.

original-el8

29 days ago

thank you so much!

original-el8

29 days ago

Looks like MikeRoz is working on the big boi:
https://huggingface.co/MikeRoz/GLM-4.5-exl3

original-el8

29 days ago

Also, looks like Thireus has a ton of options, just grabbed his IQ4_KT special sauce quant which weighs in at 168GB, and running 90k context - pretty sweet! I'll do some ppl measurement on it as well just to see where its at.

ubergarm

Owner 29 days ago

Oh nice that sounds like a good size! I'm uploading an IQ3_KT IQ3_KT 147.565 GiB (3.537 BPW) Final estimate: PPL = 3.4369 +/- 0.01975 right now! I used iq4_kss on the ffn_down_exps instead of iq4_kt actually as some of my previous testing suggested they are similar (both exactly 4.0bpw) and the iq4_kss would have faster TG if anyone had to run it on CPU.

Would love to see any numbers you get! I have my pereplexity workflow here: https://huggingface.co/ubergarm/GLM-4.5-GGUF/discussions/4#6896071d1bc2e44f792ce8f8 (mine tend to be just a tiny bit high running on CPU-only backend on this rig I've noticed comparing some CUDA folks, not sure what that is about).

Have fun playing with all the quants!

original-el8

29 days ago

For the IQ4_KT_Special quant from Thireus:

Final estimate: PPL = 3.3351 +/- 0.01906. Makes sense considering your IQ4_KSS.

How do you calculate the total bpw?

original-el8 changed discussion status to closed 29 days ago

original-el8 changed discussion status to open 29 days ago

ubergarm

Owner 29 days ago

@original-el8

For the IQ4_KT_Special quant from Thireus:

Oh nice, yes just a bit higher than my IQ4_KSS. I too am curious what size that is exactly.

How do you calculate the total bpw?

I just look at the logs when starting llama-server or running llama-perplexity it will show like so:

llm_load_print_meta: model type       = 355B.A32B
llm_load_print_meta: model ftype      = IQ3_KT - 3.125 bpw
llm_load_print_meta: model params     = 358.338 B
llm_load_print_meta: model size       = 147.565 GiB (3.537 BPW) # <--- I copy paste this line for total size/BPW
llm_load_print_meta: repeating layers = 146.560 GiB (3.529 BPW, 356.786 B parameters)
llm_load_print_meta: general.name     = GLM 4.5

original-el8

28 days ago

•

edited 28 days ago

Here's some interesting results:

$ docker run --rm --gpus all -v models:/models ik_llama:latest /usr/local/bin/llama-sweep-bench -m /models/GLM-4.5-GGUF/IQ3_KT/GLM-4.5-IQ3_KT-00001-of-00004.gguf -c 32764 -ngl 999 --no-mmap --threads 16 -b 4096 -ub 4096 -fa

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	3.896	1051.35	39.357	26.02
4096	1024	4096	4.326	946.89	43.548	23.51
4096	1024	8192	4.858	843.10	48.862	20.96
4096	1024	12288	5.381	761.18	53.799	19.03
4096	1024	16384	5.909	693.21	58.304	17.56
4096	1024	20480	6.460	634.02	63.597	16.10
4096	1024	24576	7.213	567.88	68.713	14.90
4096	1024	28672	8.192	499.99	73.088	14.01

$ docker run --rm --gpus all -v models:/models ik_llama:latest /usr/local/bin/llama-sweep-bench -m /models/GLM-4.5-GGUF/IQ4_KT-Special/GLM-4.5-THIREUS-IQ4_KT-SPECIAL_TENSOR-00001-of-01762.gguf -c 32764 -ngl 999 --no-mmap --threads 16 -b 4096 -ub 4096 -fa

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	3.918	1045.42	36.259	28.24
4096	1024	4096	4.400	931.00	40.682	25.17
4096	1024	8192	4.934	830.24	45.855	22.33
4096	1024	12288	5.451	751.40	50.933	20.10
4096	1024	16384	5.985	684.41	55.334	18.51
4096	1024	20480	6.552	625.12	60.364	16.96
4096	1024	24576	7.328	558.94	65.469	15.64
4096	1024	28672	8.411	486.97	69.932	14.64

Definitely interesting results. Not sure yet what to make of it.

ubergarm

Owner 28 days ago

@original-el8

Oh great thanks for making some llama-sweep-bench charts between mine and Thireus' quants. A couple thoughts/questions:

Can you tell me what the size of that Thireus-IQ4_KT quant is e.g. this line in the startup logs: llm_load_print_meta: model size = 147.565 GiB (3.537 BPW) # <--- I copy paste this line for total size/BPW
You can add --warmup-batch on ik's fork (no need on my mainline branch of llama-sweep-bench port though as it is hardcoded enabled there). Shouldn't effect much but otherwise the first point can be lower, however you're fully offloading so probably doesn't effect much.
Since you're fully offloaded set threads to exactly 1 e.g. --threads 1 or -t 1 can sometimes give a few more percent boost.

Yeah each quantization type has different kernel implementation depending on CUDA/Vulkan/CPU AVX2/CPU AVX_VNNI/CPU NEON etc so different mixes can perform differently.

Aver0

27 days ago

I used iq4_kss on the ffn_down_exps instead of iq4_kt actually as some of my previous testing suggested they are similar (both exactly 4.0bpw) and the iq4_kss would have faster TG if anyone had to run it on CPU.

Do you plan releasing IQ3_KL for us CPU-bound folks, or this IQ3_KT shouldn't be any slower than IQ3_KL?

ubergarm

Owner 27 days ago

•

edited 27 days ago

@Aver0

Do you plan releasing IQ3_KL for us CPU-bound folks, or this IQ3_KT shouldn't be any slower than IQ3_KL?

The recent addition of iq2_kl has been useful, but to be honest I've never tried the IQ3_KL : 4 bpw non-linear quantization mix which oddly suggests it is the same size as both the iq4_kt and also the new iq4_kss. So an IQ3_KL would technically be about the same size as the existing IQ4_KSS probably.

The more TG CPU-friendly version would likely be using iq3_ks or iq3_k.

I'm guessing it would be slower only for TG, but given only ffn_(up|gate)_exps are trellis quants it would be interesting to see how well it keeps up with the iq3_ks etc.

What target RAM+VRAM are you looking for, and perhaps I'll release one more in this range using non-trellis quants. Also feel free to give it a try and report back how it performs on your system and include ram/vram/cpu/os info too. Thanks!

Aver0

27 days ago

•

edited 27 days ago

What target RAM+VRAM are you looking for

160 RAM + 12 VRAM. IQ3_K should be the best, leaving a bit for some limited context.
I don't think my tests would be representative, since I have such unusual RAM amount (2x48 + 2x32 DDR5 running at 62GB/s).

ubergarm

Owner 26 days ago

•

edited 26 days ago

@Aver0

160 RAM + 12 VRAM. IQ3_K should be the best, leaving a bit for some limited context.
I don't think my tests would be representative, since I have such unusual RAM amount (2x48 + 2x32 DDR5 running at 62GB/s).

Oh interesting combination yes. I haven't measured how much the attn/shexp/first N ffn dense layers take-up on VRAM offload here using the usual -ngl 99 -ot exps=CPU, e.g. how much room you have left-over for kv-cache.

I did some llama-sweep-benches on an all CPU configuration and interestingly the IQ3_KT is not suffering too much on TG. Granted this is a huge AMD EPYC with a ton of cores, but throwing more cores at it actually slowed down TG (which is typical of the non-KT quants).

Also, I was able to get back some of the performance using the experimental ik_llama.cpp branch ik/q8_k_r8_avx512 supporting Zen5 avx_vnni CPU flag. So definitely try that if you have a Zen5 chip on AM5 like the AMD 9950X (my personal home gaming rig uses this and sees a benefit mostly in PP uplift).

Finally, you could probably get a little more TG uplift experimenting with a draft model e.g. https://huggingface.co/jukofyork/GLM-4.5-DRAFT-0.6B-v3.0-GGUF running with something like:

    -md DRAFT-0.6B-Q4_0.gguf \
    -ngld 99 \
    --draft 32 \

So for now I'll not upload a new model in that ~3.5bpw range and curious what you see if you give the existing IQ3_KT a try. Thanks!

CalvinZero

20 days ago

hi @ubergarm

Can you share your full args to quantize this model ? I want made a IQ4_XS_R8 version with good quality. (do I need use imatrix or calibration data?)

CalvinZero

19 days ago

•

edited 19 days ago

base on https://github.com/ikawrakow/ik_llama.cpp/pull/624 , any adjust about custom_q for GLM-4.5 with CPU?

I try this:


#/usr/bin/env bash

custom="
# 93 Repeating Layers [0-92]

# Attention
blk\..*\.attn_q.*=q8_0
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=q8_0

# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k

# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq6_k
blk\..*\.nextn\.shared_head_head\.weight=iq6_k
blk\..*\.nextn\.eh_proj\.weight=q8_0

# Non-Repeating Layers
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ5_K.gguf \
    IQ5_K \
    192

ubergarm

Owner 19 days ago

@CalvinZero

Can you share your full args to quantize this model ?

I try to provide my full commands in all of the model cards for every quant, let me know if I missed something, but I believe it is all there. You can find additional information in my quant cookers guide here: https://github.com/ikawrakow/ik_llama.cpp/discussions/434

You are welcome to use my imatrix and any data that I have provided to make your own quant! Let me know how it goes and I'd love to see any llama-sweep-bench or llama-perplexity comparisons as well!

I want made a IQ4_XS_R8 version with good quality.

I've never messed around with _R8 quants and thought they existed mainly for internal use with activations etc. I no longer release _R4 -rtr quants now either as going with larger -ub 4096 -b 4096 tends to favor non-repacked quants for PP speed now. I'd also advise to not use iq4_xs as it is an older mainline quant, but you can use the newer iq4_ks and iq4_kss SOTA quants with similar BPW but likely better Perplexity.

Regarding PR624 which you link, it only effects Q2_K, Q3_K, Q4_K, Q5_K, IQ2_KS, IQ3_KS, IQ3_K which do not appear in the recipe you listed?? If you read all of PR624 you can see that you might have to try with and without that branch compiled in and measure perplexity yourself to see which one is "better" as the tweaks don't seem to be 100% better across all quants/models but may vary a bit hence why it is unmerged so far.

What is your goal here? Are you trying to fit the best quality model into a specific RAM+VRAM target size? Anyway, have fun, quantizing is a cool hobby!

Good luck and cheers!

CalvinZero

18 days ago

Thank for the tips.

My target is to run model on pure cpu with zen4 avx512 ASAP, with reasonable perplexity.

I read some where non-linear quantization and _R4 _R8 is good for CPU. I will try compare normal and _Rx version with -rtr.

ubergarm

Owner 17 days ago

@CalvinZero

My target is to run model on pure cpu with zen4 avx512 ASAP, with reasonable perplexity.

Zen4 doesn't get much speed-up with avx512 as it still takes multiple CPU cycles. Zen5 gives the real avx_vnni CPU flags which are faster: https://github.com/ikawrakow/ik_llama.cpp/pull/710

I read some where non-linear quantization and _R4 _R8 is good for CPU. I will try compare normal and _Rx version with -rtr.

yes the repacked row interleaved can be good for CPU/RAM inferencing especially at lower batch sizes. Larger batch sizes e.g. -ub 4096 -b 4096 though can improve PP significantly even on MoE, you will want to llama-sweep-bench test to compare results as shown in the link above where I'm doing some CPU-only benchmarks.

-rtr is the same as _R4 for all quants running on CPU/RAM except if the tensor was like IQ1_S which is not symmetric with IQ1_S_R4 i think, but the rest are.. double check looking at the closed PRs on ik_llama.cpp though. you can "offline repack" using llama-quantize yourself to prepare an _r4 version of any of my quants then you don't need to use -rtr so you can still use mmap() if needed for faster startup etc.

Aver0

12 days ago

•

edited 12 days ago

@ubergarm

Thank you for advices. Sorry I didn't answer earlier. Thought I'd only write after testing every possible optimization, but my progress kinda stalled, might as well just recollect what I tried.

I have Zen4 CPU (7700), so using AVX512 actually decreases performance due to heating.

The answer to

how much room you have left-over for kv-cache

is: just 2k tokens:

llama-sweep-bench -m GLM-4.5-IQ3_KT.gguf -b 512 -ub 256 -c 2816 --no-mmap -t 8 -tb 7 -fa -ctk q8_0 -ctv q8_0 -fmoe -ot "exps=CPU" -ngl 99

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
256	64	0	18.887	13.55	22.644	2.83
256	64	256	18.531	13.81	23.138	2.77
256	64	512	19.379	13.21	22.950	2.79
256	64	768	19.192	13.34	23.809	2.69
256	64	1024	18.912	13.54	24.057	2.66
256	64	1280	18.736	13.66	26.112	2.45
256	64	1536	18.861	13.57	25.447	2.52
256	64	1792	19.062	13.43	23.655	2.71
256	64	2048	18.884	13.56	23.371	2.74
256	64	2304	19.376	13.21	23.514	2.72
256	64	2560	19.188	13.34	23.557	2.72

Obviously 2k isn't enough for anything, so I have to decrease ngl:

llama-sweep-bench -m GLM-4.5-IQ3_KT.gguf -b 1024 -ub 512 -c 10240 --no-mmap -t 8 -tb 7 -fa -ctk q8_0 -ctv q8_0 -fmoe -ot "exps=CPU" -ngl 83

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	30.990	16.52	54.778	2.34
512	128	512	30.682	16.69	56.867	2.25
512	128	1024	31.399	16.31	56.414	2.27
512	128	1536	30.918	16.56	56.719	2.26
512	128	2048	30.664	16.70	57.066	2.24
512	128	2560	31.150	16.44	56.874	2.25
512	128	3072	31.179	16.42	56.605	2.26
512	128	3584	30.571	16.75	57.030	2.24
512	128	4096	30.791	16.63	57.521	2.23
512	128	4608	31.040	16.49	58.079	2.20
512	128	5120	31.328	16.34	57.964	2.21
512	128	5632	31.479	16.26	58.625	2.18
512	128	6144	31.463	16.27	58.985	2.17
512	128	6656	31.315	16.35	58.621	2.18
512	128	7168	31.317	16.35	59.899	2.14
512	128	7680	31.168	16.43	60.143	2.13
512	128	8192	32.308	15.85	60.282	2.12
512	128	8704	31.878	16.06	59.274	2.16
512	128	9216	30.871	16.59	60.104	2.13

10k context is already enough for many simple tasks, and TG is still not too bad for a GPU-poor setup. I'm less concerned about PP with small contexts, so I don't set -ub 4096 which would require reducing ngl further.

BTW, when I tried ffn=CPU with -ngl 99 instead of exps=CPU, thus offloading only attn of all blocks, the results were worse:

llama-sweep-bench -m GLM-4.5-IQ3_KT.gguf -b 2048 -ub 512 -c 10240 --no-mmap -t 8 -tb 7 -fa -ctk q8_0 -ctv q8_0 -fmoe -ot "ffn=CPU" -ngl 99

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	30.252	16.92	60.091	2.13
512	128	512	29.573	17.31	61.239	2.09
512	128	1024	29.529	17.34	60.954	2.10
512	128	1536	29.696	17.24	63.248	2.02
512	128	2048	29.843	17.16	62.106	2.06
512	128	2560	29.749	17.21	62.514	2.05
512	128	3072	30.754	16.65	60.991	2.10
512	128	3584	32.811	15.60	62.740	2.04
512	128	4096	30.536	16.77	62.033	2.06
512	128	4608	30.473	16.80	61.400	2.08
512	128	5120	30.109	17.00	62.899	2.04
512	128	5632	30.044	17.04	64.052	2.00
512	128	6144	29.956	17.09	64.646	1.98
512	128	6656	29.830	17.16	62.800	2.04
512	128	7168	30.906	16.57	63.378	2.02
512	128	7680	31.361	16.33	63.318	2.02
512	128	8192	31.476	16.27	62.979	2.03
512	128	8704	30.113	17.00	62.722	2.04
512	128	9216	29.939	17.10	65.556	1.95
512	128	9728	30.818	16.61	63.102	2.03

Now, for 32k context the batch size 4096 makes sense:

llama-sweep-bench -m GLM-4.5-IQ3_KT.gguf -b 4096 -ub 4096 -c 32768 --no-mmap -t 8 -tb 7 -fa -ctk q8_0 -ctv q8_0 -fmoe -ot "exps=CPU" -ngl 39

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	28.031	146.12	514.676	1.99
4096	1024	4096	28.891	141.77	566.682	1.81
4096	1024	8192	30.607	133.83	613.948	1.67
4096	1024	12288	32.868	124.62	671.010	1.53
4096	1024	16384	37.314	109.77	762.135	1.34
4096	1024	20480	38.373	106.74	859.401	1.19
4096	1024	24576	42.212	97.03	935.687	1.09

I've tried GLM-4.5-DRAFT-0.6B-32k-Q4_0.gguf. It slows down TG by ~1.25x when used with llama-server. llama-sweep-bench just ignores the -md switch so no benchmarks here. It's just slow.

I've also tried --no-kv-offload with -ngl 99, it's much slower, especially after 20k tokens.

Usually when TG is lower than 1 t/s I find it too slow and switch to a lower quant. Fortunately IQ3_KT only slows down to 1.0 after 32k tokens, and coincidentally that's where all models start overlooking older parts of context, so I tend to never use more than 32k anyway.

So, for now I'll be using IQ3_KT. Thank you for the good quant. The only thing that bothers me is 10GB of free RAM that could be used for decreasing perplexity (e.g. store ffn_(gate|up) of Routed Experts in IQ3_K_R4 or IQ3_KM, not sure which is better).

Oh, actually, why the IQ3_KT stores attn_k and attn_v of Routed Experts in Q8_0 while your IQ4_KSS stores them in IQ6_K?
Shouldn't the lower quant use same IQ6_K? (would really help store more of them in my superlimited VRAM)

I was thinking about cooking a balanced "IQ3_K" quant using Thireus' GGUF-Tool-Suite, but since I'm still on Windows, it doesn't want to cooperate without some wrestling.

ubergarm

Owner 11 days ago

@Aver0

I have Zen4 CPU (7700), so using AVX512 actually decreases performance due to heating.

Yeah, Zen4 AVX512 instructions take multiple CPU clocks to perform, so no big benefit for PP like on Zen5 unfortunately. Interesting it heats your CPU up.

Obviously 2k isn't enough for anything, so I have to decrease ngl:

So I wouldn't recommend reducing ngl as the strategy for MoE is -ngl 99 and put all the routed exps on CPU/RAM. But I understand you have only 12GB VRAM which is quite low, despite a lot of RAM. You could also play with reducing kv-cache size on VRAM with heavier quantization e.g. -ctk q6_0 -ctv q6_0 or -ctk q4_1 -ctv q4_1 or -ctk iq4_nl -ctv iq4_nl or something better than q4_0 but smaller than q8_0. Even so you might not to get enough context fully offloading... hrmm...

BTW, when I tried ffn=CPU with -ngl 99 instead of exps=CPU, thus offloading only attn of all blocks, the results were worse:

Of course, you were offloading the first N dense layers ffn_(gate_down_up) as well as the shared expert ffn_(gate|down|up)_shexp which are always active weights for every token onto CPU. The strategy for MoE is to always keep those on GPU/VRAM and only offload the routed experts tensors ffn_(gate|down|up)_exps onto CPU/RAM . You can check the model card side-bar on huggingface to see the exact names of tensors for regex matching.

-t 8 -tb 7

You have 8 physics CPU cores, I'd recommend just use -t 8 and be done. Not sure why you are using less for threads-batch (prompt processing/prefill)? Typically tb should be higher, but only on big many core CPUs. For your system 8 and 8 should be best.

I've also tried --no-kv-offload with -ngl 99, it's much slower, especially after 20k tokens.

Yes, I've heard some use this strategy only if they require a ton of kv-cache and are willing to go very very slowly. Otherwise always keep kv-cace on GPU/VRAM.

I've tried GLM-4.5-DRAFT-0.6B-32k-Q4_0.gguf. It slows down TG by ~1.25x when used with llama-server. llama-sweep-bench just ignores the -md switch so no benchmarks here. It's just slow.

Ahh yeah, I've heard from some that it doesn't have enough valid tokens matching to be worth it for many types of applications. Thanks for trying. Also wth only 12GB VRAM you barely have enough to run the main model, let alone add another small model into VRAM. Huh, I thought llama-sweep-bench would use it, I have a test with a different model showing about 1 tok/sec speed-up with llama-sweep-bench. Oh well, probably not worth more exploration on your setup.

The only thing that bothers me is 10GB of free RAM that could be used for decreasing perplexity (e.g. store ffn_(gate|up) of Routed Experts in IQ3_K_R4 or IQ3_KM, not sure which is better).

Be careful trying to mini-max so much and splitting up ffn tensors across different devices. -fmoe needs to have (gate|up) on the same device at a minimum to work for example. Also depending on PCIe it could add extra communications overhead possibly going back and forth for each layer. 10GB RAM isn't going to make a noticible difference probably.

Oh, actually, why the IQ3_KT stores attn_k and attn_v of Routed Experts in Q8_0 while your IQ4_KSS stores them in IQ6_K?
Shouldn't the lower quant use same IQ6_K? (would really help store more of them in my superlimited VRAM)

The attn tensors don't belong to the routed experts. Each layer can have a mix of many tensor types. So I chose larger q8_0 attn tensors for the KT as sometimes that small larger size over iq6_k can give a noticible boost in perplexity. I generally design assuming 16GB VRAM minimum where it wouldn't matter so much. But on your system, the small savings of iq6_k < q8_0 would allow you some more context etc. Sorry about that. There are no hard and fast rules about "shouldn't the lower quant..." really. There is a tradition of quantization mix schemes e.g. IQ4_K_M OR IQ4_K_XL which have some meaning hard-coded into llama-quantize. The unsloth quants are basically just this with a slightly different mix. I only use the custom quantizations and have never limited myself to the traditional mixes as bartowski, mradermacher, and unsloth do a fine job already with those flavors.

In general keeping attn tensors a bit higher compared to the rest of the mix can give pretty good perplexity boost for the size.

How fast is your NVMe SSD? If it is PCIe Gen 5 e.g. a T700 crucial drive you might be better off going with the iq4_kss with smaller attn tensors and letting the model hang out of RAM onto SSD and letting default mmap() read-only operate off of the page cache. I've run DeepSeek 671B like this up to 4-5 tok/sec with only 96GB RAM.

The main thing to explore for you that I'd recommend is trying -rtr for run-time-repack which will disable mmap() and malloc() the entire model on start with the tensors running on CPU/RAM repacked into row interleaved format. This can give a boost to TG as it improves cpu/ram/cache effectiveness. You can also play with -ub 1024 -b 2048or batches smaller than 4096 etc. The default batches are -ub 512 -b 2048 fwiw.

Okay keep hacking and maybe you can squeeze another tok/sec out of your system!

ubergarm
/

GLM-4.5-GGUF

Any chance for an IQ3_K?