Any chance for an IQ3_K?
The IQ4_KSS is good - and does fit on 2x6000 Blackwells, but gotta keep context relatively small. Any chance for a IQ3 that could hit that space between IQ4_KSS and IQ2_KL, before ppl starts going off the rails? Maybe at something like ~160GB?
Thanks for all you do!
Yeah maybe so, I just did an IQ3_KS for Air and might be able to use the basic recipe here given they are similar, Will see what I can do!
Also if you're on 2x6k blackwells you can use the KT quants which are from the QTIP paper and similar to exllamava3 EXL3 quants. I know turboderp just got Air going: https://huggingface.co/turboderp/GLM-4.5-Air-exl3 but not sure if full size is available or in the works.
thank you so much!
Also, looks like Thireus has a ton of options, just grabbed his IQ4_KT special sauce quant which weighs in at 168GB, and running 90k context - pretty sweet! I'll do some ppl measurement on it as well just to see where its at.
Oh nice that sounds like a good size! I'm uploading an IQ3_KT IQ3_KT 147.565 GiB (3.537 BPW) Final estimate: PPL = 3.4369 +/- 0.01975 right now! I used iq4_kss on the ffn_down_exps instead of iq4_kt actually as some of my previous testing suggested they are similar (both exactly 4.0bpw) and the iq4_kss would have faster TG if anyone had to run it on CPU.
Would love to see any numbers you get! I have my pereplexity workflow here: https://huggingface.co/ubergarm/GLM-4.5-GGUF/discussions/4#6896071d1bc2e44f792ce8f8 (mine tend to be just a tiny bit high running on CPU-only backend on this rig I've noticed comparing some CUDA folks, not sure what that is about).
Have fun playing with all the quants!
For the IQ4_KT_Special quant from Thireus:
Final estimate: PPL = 3.3351 +/- 0.01906. Makes sense considering your IQ4_KSS.
How do you calculate the total bpw?
For the IQ4_KT_Special quant from Thireus:
Oh nice, yes just a bit higher than my IQ4_KSS. I too am curious what size that is exactly.
How do you calculate the total bpw?
I just look at the logs when starting llama-server or running llama-perplexity it will show like so:
llm_load_print_meta: model type = 355B.A32B
llm_load_print_meta: model ftype = IQ3_KT - 3.125 bpw
llm_load_print_meta: model params = 358.338 B
llm_load_print_meta: model size = 147.565 GiB (3.537 BPW) # <--- I copy paste this line for total size/BPW
llm_load_print_meta: repeating layers = 146.560 GiB (3.529 BPW, 356.786 B parameters)
llm_load_print_meta: general.name = GLM 4.5
Here's some interesting results:
$ docker run --rm --gpus all -v models:/models ik_llama:latest /usr/local/bin/llama-sweep-bench -m /models/GLM-4.5-GGUF/IQ3_KT/GLM-4.5-IQ3_KT-00001-of-00004.gguf -c 32764 -ngl 999 --no-mmap --threads 16 -b 4096 -ub 4096 -fa
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4096 | 1024 | 0 | 3.896 | 1051.35 | 39.357 | 26.02 |
4096 | 1024 | 4096 | 4.326 | 946.89 | 43.548 | 23.51 |
4096 | 1024 | 8192 | 4.858 | 843.10 | 48.862 | 20.96 |
4096 | 1024 | 12288 | 5.381 | 761.18 | 53.799 | 19.03 |
4096 | 1024 | 16384 | 5.909 | 693.21 | 58.304 | 17.56 |
4096 | 1024 | 20480 | 6.460 | 634.02 | 63.597 | 16.10 |
4096 | 1024 | 24576 | 7.213 | 567.88 | 68.713 | 14.90 |
4096 | 1024 | 28672 | 8.192 | 499.99 | 73.088 | 14.01 |
$ docker run --rm --gpus all -v models:/models ik_llama:latest /usr/local/bin/llama-sweep-bench -m /models/GLM-4.5-GGUF/IQ4_KT-Special/GLM-4.5-THIREUS-IQ4_KT-SPECIAL_TENSOR-00001-of-01762.gguf -c 32764 -ngl 999 --no-mmap --threads 16 -b 4096 -ub 4096 -fa
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4096 | 1024 | 0 | 3.918 | 1045.42 | 36.259 | 28.24 |
4096 | 1024 | 4096 | 4.400 | 931.00 | 40.682 | 25.17 |
4096 | 1024 | 8192 | 4.934 | 830.24 | 45.855 | 22.33 |
4096 | 1024 | 12288 | 5.451 | 751.40 | 50.933 | 20.10 |
4096 | 1024 | 16384 | 5.985 | 684.41 | 55.334 | 18.51 |
4096 | 1024 | 20480 | 6.552 | 625.12 | 60.364 | 16.96 |
4096 | 1024 | 24576 | 7.328 | 558.94 | 65.469 | 15.64 |
4096 | 1024 | 28672 | 8.411 | 486.97 | 69.932 | 14.64 |
Definitely interesting results. Not sure yet what to make of it.
Oh great thanks for making some llama-sweep-bench charts between mine and Thireus' quants. A couple thoughts/questions:
- Can you tell me what the size of that Thireus-IQ4_KT quant is e.g. this line in the startup logs:
llm_load_print_meta: model size = 147.565 GiB (3.537 BPW) # <--- I copy paste this line for total size/BPW
- You can add
--warmup-batch
on ik's fork (no need on my mainline branch of llama-sweep-bench port though as it is hardcoded enabled there). Shouldn't effect much but otherwise the first point can be lower, however you're fully offloading so probably doesn't effect much. - Since you're fully offloaded set threads to exactly 1 e.g.
--threads 1
or-t 1
can sometimes give a few more percent boost.
Yeah each quantization type has different kernel implementation depending on CUDA/Vulkan/CPU AVX2/CPU AVX_VNNI/CPU NEON etc so different mixes can perform differently.
I used iq4_kss on the ffn_down_exps instead of iq4_kt actually as some of my previous testing suggested they are similar (both exactly 4.0bpw) and the iq4_kss would have faster TG if anyone had to run it on CPU.
Do you plan releasing IQ3_KL for us CPU-bound folks, or this IQ3_KT shouldn't be any slower than IQ3_KL?
Do you plan releasing IQ3_KL for us CPU-bound folks, or this IQ3_KT shouldn't be any slower than IQ3_KL?
The recent addition of iq2_kl has been useful, but to be honest I've never tried the IQ3_KL : 4 bpw non-linear quantization mix
which oddly suggests it is the same size as both the iq4_kt and also the new iq4_kss. So an IQ3_KL would technically be about the same size as the existing IQ4_KSS probably.
The more TG CPU-friendly version would likely be using iq3_ks or iq3_k.
I'm guessing it would be slower only for TG, but given only ffn_(up|gate)_exps are trellis quants it would be interesting to see how well it keeps up with the iq3_ks etc.
What target RAM+VRAM are you looking for, and perhaps I'll release one more in this range using non-trellis quants. Also feel free to give it a try and report back how it performs on your system and include ram/vram/cpu/os info too. Thanks!
What target RAM+VRAM are you looking for
160 RAM + 12 VRAM. IQ3_K should be the best, leaving a bit for some limited context.
I don't think my tests would be representative, since I have such unusual RAM amount (2x48 + 2x32 DDR5 running at 62GB/s).
160 RAM + 12 VRAM. IQ3_K should be the best, leaving a bit for some limited context.
I don't think my tests would be representative, since I have such unusual RAM amount (2x48 + 2x32 DDR5 running at 62GB/s).
Oh interesting combination yes. I haven't measured how much the attn/shexp/first N ffn dense layers take-up on VRAM offload here using the usual -ngl 99 -ot exps=CPU
, e.g. how much room you have left-over for kv-cache.
I did some llama-sweep-benches on an all CPU configuration and interestingly the IQ3_KT is not suffering too much on TG. Granted this is a huge AMD EPYC with a ton of cores, but throwing more cores at it actually slowed down TG (which is typical of the non-KT quants).
Also, I was able to get back some of the performance using the experimental ik_llama.cpp branch ik/q8_k_r8_avx512
supporting Zen5 avx_vnni CPU flag. So definitely try that if you have a Zen5 chip on AM5 like the AMD 9950X (my personal home gaming rig uses this and sees a benefit mostly in PP uplift).
Finally, you could probably get a little more TG uplift experimenting with a draft model e.g. https://huggingface.co/jukofyork/GLM-4.5-DRAFT-0.6B-v3.0-GGUF running with something like:
-md DRAFT-0.6B-Q4_0.gguf \
-ngld 99 \
--draft 32 \
So for now I'll not upload a new model in that ~3.5bpw range and curious what you see if you give the existing IQ3_KT a try. Thanks!
hi @ubergarm
Can you share your full args to quantize this model ? I want made a IQ4_XS_R8 version with good quality. (do I need use imatrix or calibration data?)
base on https://github.com/ikawrakow/ik_llama.cpp/pull/624 , any adjust about custom_q for GLM-4.5
with CPU?
I try this:
#/usr/bin/env bash
custom="
# 93 Repeating Layers [0-92]
# Attention
blk\..*\.attn_q.*=q8_0
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=q8_0
# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0
# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k
# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq6_k
blk\..*\.nextn\.shared_head_head\.weight=iq6_k
blk\..*\.nextn\.eh_proj\.weight=q8_0
# Non-Repeating Layers
token_embd\.weight=iq6_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N 0 -m 0 \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
/mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
/mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ5_K.gguf \
IQ5_K \
192
Can you share your full args to quantize this model ?
I try to provide my full commands in all of the model cards for every quant, let me know if I missed something, but I believe it is all there. You can find additional information in my quant cookers guide here: https://github.com/ikawrakow/ik_llama.cpp/discussions/434
You are welcome to use my imatrix and any data that I have provided to make your own quant! Let me know how it goes and I'd love to see any llama-sweep-bench or llama-perplexity comparisons as well!
I want made a IQ4_XS_R8 version with good quality.
I've never messed around with _R8
quants and thought they existed mainly for internal use with activations etc. I no longer release _R4
-rtr
quants now either as going with larger -ub 4096 -b 4096
tends to favor non-repacked quants for PP speed now. I'd also advise to not use iq4_xs
as it is an older mainline quant, but you can use the newer iq4_ks
and iq4_kss
SOTA quants with similar BPW but likely better Perplexity.
Regarding PR624 which you link, it only effects Q2_K, Q3_K, Q4_K, Q5_K, IQ2_KS, IQ3_KS, IQ3_K
which do not appear in the recipe you listed?? If you read all of PR624 you can see that you might have to try with and without that branch compiled in and measure perplexity yourself to see which one is "better" as the tweaks don't seem to be 100% better across all quants/models but may vary a bit hence why it is unmerged so far.
What is your goal here? Are you trying to fit the best quality model into a specific RAM+VRAM target size? Anyway, have fun, quantizing is a cool hobby!
Good luck and cheers!
Thank for the tips.
My target is to run model on pure cpu with zen4 avx512 ASAP, with reasonable perplexity.
I read some where non-linear quantization
and _R4
_R8
is good for CPU. I will try compare normal and _Rx
version with -rtr
.
My target is to run model on pure cpu with zen4 avx512 ASAP, with reasonable perplexity.
Zen4 doesn't get much speed-up with avx512 as it still takes multiple CPU cycles. Zen5 gives the real avx_vnni CPU flags which are faster: https://github.com/ikawrakow/ik_llama.cpp/pull/710
I read some where non-linear quantization and _R4 _R8 is good for CPU. I will try compare normal and _Rx version with -rtr.
yes the repacked row interleaved can be good for CPU/RAM inferencing especially at lower batch sizes. Larger batch sizes e.g. -ub 4096 -b 4096
though can improve PP significantly even on MoE, you will want to llama-sweep-bench test to compare results as shown in the link above where I'm doing some CPU-only benchmarks.
-rtr
is the same as _R4
for all quants running on CPU/RAM except if the tensor was like IQ1_S which is not symmetric with IQ1_S_R4 i think, but the rest are.. double check looking at the closed PRs on ik_llama.cpp though. you can "offline repack" using llama-quantize
yourself to prepare an _r4 version of any of my quants then you don't need to use -rtr
so you can still use mmap()
if needed for faster startup etc.
Thank you for advices. Sorry I didn't answer earlier. Thought I'd only write after testing every possible optimization, but my progress kinda stalled, might as well just recollect what I tried.
I have Zen4 CPU (7700), so using AVX512 actually decreases performance due to heating.
The answer to
how much room you have left-over for kv-cache
is: just 2k tokens:
llama-sweep-bench -m GLM-4.5-IQ3_KT.gguf -b 512 -ub 256 -c 2816 --no-mmap -t 8 -tb 7 -fa -ctk q8_0 -ctv q8_0 -fmoe -ot "exps=CPU" -ngl 99
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
256 | 64 | 0 | 18.887 | 13.55 | 22.644 | 2.83 |
256 | 64 | 256 | 18.531 | 13.81 | 23.138 | 2.77 |
256 | 64 | 512 | 19.379 | 13.21 | 22.950 | 2.79 |
256 | 64 | 768 | 19.192 | 13.34 | 23.809 | 2.69 |
256 | 64 | 1024 | 18.912 | 13.54 | 24.057 | 2.66 |
256 | 64 | 1280 | 18.736 | 13.66 | 26.112 | 2.45 |
256 | 64 | 1536 | 18.861 | 13.57 | 25.447 | 2.52 |
256 | 64 | 1792 | 19.062 | 13.43 | 23.655 | 2.71 |
256 | 64 | 2048 | 18.884 | 13.56 | 23.371 | 2.74 |
256 | 64 | 2304 | 19.376 | 13.21 | 23.514 | 2.72 |
256 | 64 | 2560 | 19.188 | 13.34 | 23.557 | 2.72 |
Obviously 2k isn't enough for anything, so I have to decrease ngl:
llama-sweep-bench -m GLM-4.5-IQ3_KT.gguf -b 1024 -ub 512 -c 10240 --no-mmap -t 8 -tb 7 -fa -ctk q8_0 -ctv q8_0 -fmoe -ot "exps=CPU" -ngl 83
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 30.990 | 16.52 | 54.778 | 2.34 |
512 | 128 | 512 | 30.682 | 16.69 | 56.867 | 2.25 |
512 | 128 | 1024 | 31.399 | 16.31 | 56.414 | 2.27 |
512 | 128 | 1536 | 30.918 | 16.56 | 56.719 | 2.26 |
512 | 128 | 2048 | 30.664 | 16.70 | 57.066 | 2.24 |
512 | 128 | 2560 | 31.150 | 16.44 | 56.874 | 2.25 |
512 | 128 | 3072 | 31.179 | 16.42 | 56.605 | 2.26 |
512 | 128 | 3584 | 30.571 | 16.75 | 57.030 | 2.24 |
512 | 128 | 4096 | 30.791 | 16.63 | 57.521 | 2.23 |
512 | 128 | 4608 | 31.040 | 16.49 | 58.079 | 2.20 |
512 | 128 | 5120 | 31.328 | 16.34 | 57.964 | 2.21 |
512 | 128 | 5632 | 31.479 | 16.26 | 58.625 | 2.18 |
512 | 128 | 6144 | 31.463 | 16.27 | 58.985 | 2.17 |
512 | 128 | 6656 | 31.315 | 16.35 | 58.621 | 2.18 |
512 | 128 | 7168 | 31.317 | 16.35 | 59.899 | 2.14 |
512 | 128 | 7680 | 31.168 | 16.43 | 60.143 | 2.13 |
512 | 128 | 8192 | 32.308 | 15.85 | 60.282 | 2.12 |
512 | 128 | 8704 | 31.878 | 16.06 | 59.274 | 2.16 |
512 | 128 | 9216 | 30.871 | 16.59 | 60.104 | 2.13 |
10k context is already enough for many simple tasks, and TG is still not too bad for a GPU-poor setup. I'm less concerned about PP with small contexts, so I don't set -ub 4096
which would require reducing ngl
further.
BTW, when I tried ffn=CPU
with -ngl 99
instead of exps=CPU
, thus offloading only attn of all blocks, the results were worse:
llama-sweep-bench -m GLM-4.5-IQ3_KT.gguf -b 2048 -ub 512 -c 10240 --no-mmap -t 8 -tb 7 -fa -ctk q8_0 -ctv q8_0 -fmoe -ot "ffn=CPU" -ngl 99
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 30.252 | 16.92 | 60.091 | 2.13 |
512 | 128 | 512 | 29.573 | 17.31 | 61.239 | 2.09 |
512 | 128 | 1024 | 29.529 | 17.34 | 60.954 | 2.10 |
512 | 128 | 1536 | 29.696 | 17.24 | 63.248 | 2.02 |
512 | 128 | 2048 | 29.843 | 17.16 | 62.106 | 2.06 |
512 | 128 | 2560 | 29.749 | 17.21 | 62.514 | 2.05 |
512 | 128 | 3072 | 30.754 | 16.65 | 60.991 | 2.10 |
512 | 128 | 3584 | 32.811 | 15.60 | 62.740 | 2.04 |
512 | 128 | 4096 | 30.536 | 16.77 | 62.033 | 2.06 |
512 | 128 | 4608 | 30.473 | 16.80 | 61.400 | 2.08 |
512 | 128 | 5120 | 30.109 | 17.00 | 62.899 | 2.04 |
512 | 128 | 5632 | 30.044 | 17.04 | 64.052 | 2.00 |
512 | 128 | 6144 | 29.956 | 17.09 | 64.646 | 1.98 |
512 | 128 | 6656 | 29.830 | 17.16 | 62.800 | 2.04 |
512 | 128 | 7168 | 30.906 | 16.57 | 63.378 | 2.02 |
512 | 128 | 7680 | 31.361 | 16.33 | 63.318 | 2.02 |
512 | 128 | 8192 | 31.476 | 16.27 | 62.979 | 2.03 |
512 | 128 | 8704 | 30.113 | 17.00 | 62.722 | 2.04 |
512 | 128 | 9216 | 29.939 | 17.10 | 65.556 | 1.95 |
512 | 128 | 9728 | 30.818 | 16.61 | 63.102 | 2.03 |
Now, for 32k context the batch size 4096 makes sense:
llama-sweep-bench -m GLM-4.5-IQ3_KT.gguf -b 4096 -ub 4096 -c 32768 --no-mmap -t 8 -tb 7 -fa -ctk q8_0 -ctv q8_0 -fmoe -ot "exps=CPU" -ngl 39
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4096 | 1024 | 0 | 28.031 | 146.12 | 514.676 | 1.99 |
4096 | 1024 | 4096 | 28.891 | 141.77 | 566.682 | 1.81 |
4096 | 1024 | 8192 | 30.607 | 133.83 | 613.948 | 1.67 |
4096 | 1024 | 12288 | 32.868 | 124.62 | 671.010 | 1.53 |
4096 | 1024 | 16384 | 37.314 | 109.77 | 762.135 | 1.34 |
4096 | 1024 | 20480 | 38.373 | 106.74 | 859.401 | 1.19 |
4096 | 1024 | 24576 | 42.212 | 97.03 | 935.687 | 1.09 |
I've tried GLM-4.5-DRAFT-0.6B-32k-Q4_0.gguf. It slows down TG by ~1.25x when used with llama-server. llama-sweep-bench just ignores the -md
switch so no benchmarks here. It's just slow.
I've also tried --no-kv-offload
with -ngl 99
, it's much slower, especially after 20k tokens.
Usually when TG is lower than 1 t/s I find it too slow and switch to a lower quant. Fortunately IQ3_KT only slows down to 1.0 after 32k tokens, and coincidentally that's where all models start overlooking older parts of context, so I tend to never use more than 32k anyway.
So, for now I'll be using IQ3_KT. Thank you for the good quant. The only thing that bothers me is 10GB of free RAM that could be used for decreasing perplexity (e.g. store ffn_(gate|up) of Routed Experts in IQ3_K_R4 or IQ3_KM, not sure which is better).
Oh, actually, why the IQ3_KT stores attn_k and attn_v of Routed Experts in Q8_0 while your IQ4_KSS stores them in IQ6_K?
Shouldn't the lower quant use same IQ6_K? (would really help store more of them in my superlimited VRAM)
I was thinking about cooking a balanced "IQ3_K" quant using Thireus' GGUF-Tool-Suite, but since I'm still on Windows, it doesn't want to cooperate without some wrestling.
I have Zen4 CPU (7700), so using AVX512 actually decreases performance due to heating.
Yeah, Zen4 AVX512 instructions take multiple CPU clocks to perform, so no big benefit for PP like on Zen5 unfortunately. Interesting it heats your CPU up.
Obviously 2k isn't enough for anything, so I have to decrease ngl:
So I wouldn't recommend reducing ngl as the strategy for MoE is -ngl 99
and put all the routed exps on CPU/RAM. But I understand you have only 12GB VRAM which is quite low, despite a lot of RAM. You could also play with reducing kv-cache size on VRAM with heavier quantization e.g. -ctk q6_0 -ctv q6_0
or -ctk q4_1 -ctv q4_1
or -ctk iq4_nl -ctv iq4_nl
or something better than q4_0 but smaller than q8_0. Even so you might not to get enough context fully offloading... hrmm...
BTW, when I tried ffn=CPU with -ngl 99 instead of exps=CPU, thus offloading only attn of all blocks, the results were worse:
Of course, you were offloading the first N dense layers ffn_(gate_down_up)
as well as the shared expert ffn_(gate|down|up)_shexp
which are always active weights for every token onto CPU. The strategy for MoE is to always keep those on GPU/VRAM and only offload the routed experts tensors ffn_(gate|down|up)_exps
onto CPU/RAM . You can check the model card side-bar on huggingface to see the exact names of tensors for regex matching.
-t 8 -tb 7
You have 8 physics CPU cores, I'd recommend just use -t 8
and be done. Not sure why you are using less for threads-batch (prompt processing/prefill)? Typically tb should be higher, but only on big many core CPUs. For your system 8 and 8 should be best.
I've also tried --no-kv-offload with -ngl 99, it's much slower, especially after 20k tokens.
Yes, I've heard some use this strategy only if they require a ton of kv-cache and are willing to go very very slowly. Otherwise always keep kv-cace on GPU/VRAM.
I've tried GLM-4.5-DRAFT-0.6B-32k-Q4_0.gguf. It slows down TG by ~1.25x when used with llama-server. llama-sweep-bench just ignores the -md switch so no benchmarks here. It's just slow.
Ahh yeah, I've heard from some that it doesn't have enough valid tokens matching to be worth it for many types of applications. Thanks for trying. Also wth only 12GB VRAM you barely have enough to run the main model, let alone add another small model into VRAM. Huh, I thought llama-sweep-bench would use it, I have a test with a different model showing about 1 tok/sec speed-up with llama-sweep-bench. Oh well, probably not worth more exploration on your setup.
The only thing that bothers me is 10GB of free RAM that could be used for decreasing perplexity (e.g. store ffn_(gate|up) of Routed Experts in IQ3_K_R4 or IQ3_KM, not sure which is better).
Be careful trying to mini-max so much and splitting up ffn tensors across different devices. -fmoe
needs to have (gate|up) on the same device at a minimum to work for example. Also depending on PCIe it could add extra communications overhead possibly going back and forth for each layer. 10GB RAM isn't going to make a noticible difference probably.
Oh, actually, why the IQ3_KT stores attn_k and attn_v of Routed Experts in Q8_0 while your IQ4_KSS stores them in IQ6_K?
Shouldn't the lower quant use same IQ6_K? (would really help store more of them in my superlimited VRAM)
The attn tensors don't belong to the routed experts. Each layer can have a mix of many tensor types. So I chose larger q8_0 attn tensors for the KT as sometimes that small larger size over iq6_k can give a noticible boost in perplexity. I generally design assuming 16GB VRAM minimum where it wouldn't matter so much. But on your system, the small savings of iq6_k < q8_0 would allow you some more context etc. Sorry about that. There are no hard and fast rules about "shouldn't the lower quant..." really. There is a tradition of quantization mix schemes e.g. IQ4_K_M
OR IQ4_K_XL
which have some meaning hard-coded into llama-quantize. The unsloth quants are basically just this with a slightly different mix. I only use the custom quantizations and have never limited myself to the traditional mixes as bartowski, mradermacher, and unsloth do a fine job already with those flavors.
In general keeping attn tensors a bit higher compared to the rest of the mix can give pretty good perplexity boost for the size.
How fast is your NVMe SSD? If it is PCIe Gen 5 e.g. a T700 crucial drive you might be better off going with the iq4_kss with smaller attn tensors and letting the model hang out of RAM onto SSD and letting default mmap() read-only operate off of the page cache. I've run DeepSeek 671B like this up to 4-5 tok/sec with only 96GB RAM.
The main thing to explore for you that I'd recommend is trying -rtr
for run-time-repack which will disable mmap() and malloc() the entire model on start with the tensors running on CPU/RAM repacked into row interleaved format. This can give a boost to TG as it improves cpu/ram/cache effectiveness. You can also play with -ub 1024 -b 2048
or batches smaller than 4096 etc. The default batches are -ub 512 -b 2048
fwiw.
Okay keep hacking and maybe you can squeeze another tok/sec out of your system!