Thanks for your work! Any chance for something between Q2_K_R and Q3_K_R?
As always, thanks for your work.
As you knew, I had 5090+4090x2+A6000+3090.
Luckily, I managed to get another 5090, for quite cheap (1900USD), but at the same time, my A6000 died (RIP), so I have 16GB less vs before. So now I have 5090x2+4090x2+3090.
I can fit Q2_K_R with ease, but I'm out of reach with Q3_K_R. As the former seems to be ~9.2% worse vs Q8_0, at 2.8 bpw. Is there a chance for something a bit higher, like 3.0 bpw or 3.1 bpw? As, maybe, that would be a significant increase in performance. I think, comparing to Deepseek v3 0324, my max bpw seems to be about ~3.4 bpw at 685B.
If not, there is no problem and I can close the issue.
Thanks!
Heya!! You have quite a collection of GPUs , hard to keep track lol.
So you have 136GB VRAM currently plus how much RAM (i forget)?
The quant sizes are pretty dependent on the size of ffn_down and ffn_(gate|up).. I ususally use the smallest for gate|up and one size bigger for down as is the precedent from my own research.
I'm not sure what quant sizes to use for something in-between to hit that 3.0~3.1 overall BPW which would be like 254GiB file size or so. There are some very new QTIP / exl3 style trellis quants available e.g. iq2_kt
.. Maybe using full iq5_ks for attn/shexp/token_embed and iq4_kt ffn_down and iq3_kt ffn_(gate|up) would land in that in-between zone that is missing a quant...
However, the inferencing implementation isn't fully baked yet and why I haven't released my test quant that took like 8 hours to cook haha...
Anyway, I'll keep it in mind to try an iq3_kt
which might be roughly in that size... But don't expect anything immediately from me. I'll leave this open in case other folks have thoughts or also are interested too.
Cheers!
Hello! Yeah haha, sadly the A6000 died else I would have ton of mroe VRAM.
I have 136GB VRAM and 192GB RAM. But I can't load effectively IQ3_K_R. On Linux/Fedora seems my max is about 180GB usable, which is still more than 300GB, but since I can't load the tensors by size but by layers, some GPUs have 2-3 GBs left (but can't add more, as each layer is about 4GB with up, down and gate; I guess I could partially load a part of a layer on a gpu and the other part on another one).
I can load DeepSeekV3 0324 Q3_K_XL, which it's real size of 276GB, using all the GPUs and ~170-175GB RAM.
And yeah, I can imagine it takes a lot of time! If you can it's welcome, but no pressures or worries.
Just an update, I semi revived my A6000 temporaly, by resoldering the power connector and using an EPS 8pin cable instead of the adapter. So now I have 208GB VRAM + 192GB RAM, which lets me load IQ3_K_R4 without issues (and with huge batch size).
But I will leave the post up still, as something between IQ2_K_R4 and IQ3_K_R4, it would be very helpful (and also my A6000 could die at any moment again lol)
Also, both of these work perfectly with multiGPU, getting the expected TG speed! (about 9-10 t/s) when mora than half is offloaded to system RAM. PP t/s is very variable but it also aligns with the expected performance.
Haha, wow you are brave bringing a soldering iron to your GPU, but if it was dead otherwise makes sense.
Yeah another user on reddit u/ciprianveg
was just asking me for the same size so there is definitely some demand for that ~3.3ish BPW ~256GiB size which has a gap right now.
I'm experimenting currently with ik's latest iq4_kt
which is 4.0 BPW and kind of a better iq4_kss
, but it is quite new still. I'll noodle on it and definitely let you know.
So yeah keep this open and we'll see! Thanks!
Thanks for remembering :) yes, I am also looking for something between IQ2_K_R4 and IQ3_K_R4, cca 260 +/- 10GB
@panchovix
@ciprianv
@voidstare
No promises, but check this out:
ubergarm/DeepSeek-R1-0528-IQ3_KT
272.527 GiB (3.483 BPW)
I just tested ik's latest PR with interesting looking experimental iq3_kt quants just added. Will have to wait for these new quants to be finalized, and I still want some more data for speed comparisons with similar sized quants. Quality looks pretty good though in terms of both perplexity and KLD!
It is maybe a touch too big though, not sure what y'all think?
I will download it today and try it tomorrow. Size is perfect for me. Thank you!
L.E. I see this quant is not available yet: ubergarm/DeepSeek-R1-0528-IQ3_KT
Sorry I was not clear - I did not yet release this model. The PRs are still open and I want to wait for the iqN_kt
quant implementation to settle down before releasing anything which could change/break soon.
In general it sounds like 273GiB is not too big though for the ~3.5BPW model.
And yes I've had some requests for a ~192GiB model too for the 4x48GB DIMM crowd so keeping it back of mind once the dust settles haha...
Thanks!
The size is perfect for me, so I will wait for the pr to be merged and, if all ok, for you to publish the new 273 gb model. 😀👍
I still haven't upload the experimental iq3_kt
quant as I'm not yet sure if the implementation will change again soon. Just did some interesting benchmarks and discussions with ik here which suggest while PP is excellent, (among the best of the various quants), the TG is limited now by CPU due to additional overhead of int32 calculation during Trellis sequence calculation while unpacking. It seems to hit Intel Xeon systems harder than AMD Zen4/5 given the specifics of avx2
CPU instructions.
So the tl;dr; is if things seem stable in the next few days, I'll go ahead and upload it knowing it is experimental and might change. The PP performance should be top notch, but the TG will likely be slower than equivalent sized quants for the tensors/layers running on CPU/RAM.
Quick update: things are improving token generation speed-wise for CPU offload layers on the new KT
Trellis quants.
Note that this llama-sweep-bench
is for default batch sizes offloading about as many extra layers onto 2x GPUs as possible. If you use larger batch sizes e.g. -ub 2048 -b 2048
the IQ3_KT will likely begin to out-perform the _R4
quant for prompt processing speed (no effect on TG).
The PR541 used for testing is not merged at the time of writing, but may be very soon. I think it is good enough to publish at this point at least for testing knowing it may still be breaking changes if ik wants to do something different. So if merges this PR to main, I'll go ahead and upload with the warning that there are no guarantees that the KT
quants won't have future breaking changes.
At least you'll have something new and fun to test hahah...
The new IQ3_KT quality looks pretty good and the KLD max delta P was very low so seems like a good intermediate size between the nearest other two:
- IQ3_K_R4
- 300.938 GiB (3.847 BPW)
- Perplexity 3.2730 +/- 0.01738
- IQ3_KT
- 272.527 GiB (3.483 BPW)
- Perplexity 3.3056 +/- 0.01758
- IQ2_K_R4
- 219.019 GiB (2.799 BPW)
- 3.5069 +/- 0.01893
🤞Fingers crossed! 🤞
Amazing, many thanks! Downloading it right now to see how it goes!
Testing and it feels noticeable better than IQ2_K_R4!
Just it is a bit slower on the TG side as you mentioned before.
I get
INFO [ print_timings] prompt eval time = 8978.25 ms / 1407 tokens ( 6.38 ms per token, 156.71 tokens per second) | tid="139804485447680" timestamp=1750548291 id_slot=0 id_task=0 t_prompt_processing=8978.255 n_prompt_tokens_processed=1407 t_token=6.381133617626154 n_tokens_second=156.71196685770232
INFO [ print_timings] generation eval time = 210023.44 ms / 998 runs ( 210.44 ms per token, 4.75 tokens per second) | tid="139804485447680" timestamp=1750548291 id_slot=0 id_task=0 t_token_generation=210023.439 n_decoded=998 t_token=210.44432765531064 n_tokens_second=4.751850577972871
With -ub 2048 but probably can increase it to about 3096 without much issues.
I'm running it with
./llama-server -m '/GGUFs/DeepSeek-R1-0528-IQ3_KT-00001-of-00006.gguf' -c 65536 --no-mmap -ngl 999 \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23).ffn.=CUDA4" \
-ot "blk.(24|25|26).ffn.=CUDA5" \
-ot "blk.(27|28|29|30|31|32|33|34).ffn.=CUDA6" \
-ot "blk.35.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA4" \
-ot "blk.35.ffn_gate_exps.weight=CUDA4" \
-ot "blk.36.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA5" \
-ot "blk.36.ffn_gate_exps.weight=CUDA5" \
-ot "ffn.*=CPU" \
-fa -mg 0 -ub 2048 -mla 1
Devices are
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 3: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 5: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 6: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
So many thanks for the model! If only I had 100GB more VRAM to not need to offload haha
Hi, thank you for your work, unfortunately on my 3955 threadripper looks slow on the tg speed with the KT quant, even if it is good on the pp speed, less than half the tg speed of the IQ2_K_R4 model: (3.4 vs 8t/s)
main: n_kv_max = 73984, n_batch = 4352, n_ubatch = 4352, flash_attn = 1, n_gpu_layers = 63, n_threads = 16, n_threads_batch = 16
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4352 | 1088 | 0 | 21.359 | 203.76 | 319.052 | 3.41 |
4352 | 1088 | 4352 | 22.755 | 191.25 | 327.814 | 3.32 |
CUDA_VISIBLE_DEVICES="0,1,2"
./build/bin/llama-sweep-bench
--model /media/ciprian/ssd/models/DeepSeek-R1-0528-iQ3-KT/DeepSeek-R1-0528-IQ3_KT-00001-of-00006.gguf
--alias DeepSeek-R1-0528-IQ3_KT
--ctx-size 73984
-ctk q8_0
-mla 3 -fa
-amb 512
-fmoe
--temp 0
--min-p 0.01
--n-gpu-layers 63
-ot "blk.[2-4].ffn_up_exps=CUDA0,blk.[2-4].ffn_gate_exps=CUDA0,blk.[4].ffn_down_exps=CUDA0"
-ot "blk.1[0-2].ffn_up_exps=CUDA1,blk.1[0-2].ffn_gate_exps=CUDA1,blk.1[0].ffn_down_exps=CUDA1"
-ot "blk.1[3-4].ffn_up_exps=CUDA2,blk.1[3-4].ffn_gate_exps=CUDA2"
--override-tensor exps=CPU
--parallel 1
--threads 16
--threads-batch 16
--host 0.0.0.0 --port 5002
--ubatch-size 4352 --batch-size 4352 --no-mmap