Wahoo thanks for sharing your work!
Hey good to see you again! Great job cooking and releasing more ik quants! I appreciate the example commands and graphs with some benchmarks, very nice and thoughtful! This looks like a nice quant and folks have been asking me for a bigger one for the 512-768GB class rigs.
If I had two things I might suggest, which are not criticism, but my wishlist haha:
- I've been encouraging folks and adding
ik_llama.cpp
to the model card tag at the top to help people find ik's quants e.g.
👈 huggingface readme modelcard tags
---
quantized_by: ubergarm
pipeline_tag: text-generation
base_model: deepseek-ai/DeepSeek-R1-0528
license: mit
base_model_relation: quantized
tags:
- mla
- imatrix
- conversational
- ik_llama.cpp
---
- I know it takes forever, but I'd be curious to compare the final PPL and not a graph of the first few blocks e.g.
Final estimate: PPL = 3.2688 +/- 0.01739
. I'd love to compare some of my quants to this one and having the same methodology would make that easier. I hope to release some comparisons and you could use them to compare your own as well then! No pressure, I know it takes a long time to let it finish haha...
👈 specific perplexity methodology
# i grabbed wiki.test.raw from ik's github link, he also has a huggingface repo with test files too
# here is the exact file i use
$ wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz
$ gunzip wiki.test.raw.gz
$ du -h wiki.test.raw
1.3M wiki.test.raw
$ sha256sum wiki.test.raw
173c87a53759e0201f33e0ccf978e510c2042d7f2cb78229d9a50d79b9e7dd08 wiki.test.raw
# with your specific numa/threads/offloading etc, just let it run to finish
$ numactl -N 0,1,2 --interleave=0,1,2 \
./build/bin/llama-perplexity \
--model "$model" \
-mla 3 -fa \
-amb 512 \
-rtr \
-fmoe \
-f wiki.test.raw \
--seed 1337 \
--threads 128 \
--numa numactl \
2>&1 | tee -a $logfile
.
.
.
Final estimate: PPL = 3.2688 +/- 0.01739
Thanks, I was looking for the huggungface docs on how to add all the extra metadata to the model, and DeepSeek-R1 info was a little outdated/misleading.
I'll dig through the logs and publish the exact perplexity numbers.
Yeah I've noticed that the huggingface model card sidebar doesn't work for some ik quants e.g. this one at least exists but has blank entries for the ik quants: https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF?show_file_info=IQ2_K_R4%2FDeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf
You can get it yourself with ./gguf-py/scripts/gguf_dump.py
but depends on if all the quant types codes have been implemented in ik's python as that typically lags behind the cpp. I believe some folks did update it fairly recently. Let me know if you get that working to view gguf metadata.
EDIT: ooh u mean the readme.md modelcard tags, yes, I didn't read the docs and just hacked on it a bit until it worked okay lol...
I still need to figure out how to edit the metadata to add in some extra fields like who made it, the URL etc, but that is more "nice to have" and I've skipped it every time just to rush the quant out the door haha...
Yeah, the ik_llama quants like _R4
and _R8
are not supported by vanilla gguf-dump. I had to manually patch the python code in gguf-dump venv to get it working with these quants. Maybe worth posting the gguf-dump as well, since it's not easily accessible.
Thank you for sharing this quant.
It looks exactly like something I would want to use. Do you have any idea on how to benchmark this quant vs ubergarm's quant of equivalent size for coding tasks?
Would an aiderbench be useful?
I'd be very interested even in something informal otherwise.
Thx!
Glad you found it useful! I made this quant to use myself, if ubergarm had a quant of similar size, I'd be using that instead of crowding the space with more of the same.
Something like aiderbench could be useful, but I don't put much trust in the formal benchmarks, because models tend to get over-optimized to perform well on those and sometimes benchmark gains don't transfer to the real world applications. The other measure not captured well by the formal benchmarks are how badly models blunder when they fail: are they a little bit off, or produce complete giberrish. If it's measured as a best attempt out of 3, then it doesn't matter for the score. However, in practical applications with multiple steps, these extreme failures can get the model stuck permanently, where it can't dig itself out of the pit.
In terms of informal benchmarking, I have an agent that attempts to implement a 3d game over several iteration. I run it many times and evaluate how far it's gotten before going off the rails. Smaller quants consistently go off the rails much sooner than larger quants. A simpler version of this is to ask the model to implement the spinning hexagon benchmark, and see how many attempts it takes to produce working code that meets all the requirements without additional fixes. In the extreme case of Q1 quants, they never produce anything that runs, Q2 quants produce code that runs but most of the time doesn't meet the requirements. Larger quants can usually solve this, but then it becomes the question of how many times they need to try. You can also evaluate how badly they blunder: is it a catastrophic failure that would break the chain of dependent tasks, or a minor import issues that is easily fixed.
Another interesting observation I had for DeepSeek quants, is the chain of thought tends to be in shorter paragraphs for smaller quants. When you go from Q2 to Q4 you can see the length difference in each "thought" as it scrolls on your terminal because newlines usually separate distinct "thoughts". Larger quants will have coherent longer "thoughts" that probe deeper into the problem domain, while the smaller quants will have shallow surface level "thoughts" that are less useful.
Hope this helps!
I'd be curious to see some kind of aiderbench too, but for simplicity I use the built in llama-perplexity
to measure Perplexity on wiki.test.raw as well as KLD on a personal unpublished novel text test corpus.
Interestingly
@anikifoss
chose to not use imatrix and I'm very curious if it effects the numbers much or not. I've since released a larger model using IQ5_KS and IQ4_KS which increase speed a bit at the cost of quality. I believe this repos DQ4_K_R4
should be about 413.2 GiB adding up the GBs for each file which is still the most chonky ik quant published in terms of pure BPW psure given it is using IQ6_K for ffn_down and IQ4_K for (gate|up). My biggest one is the IQ4_KS_R4
is 4.701 BPW (368GiB) now.
Huh, thanks for sharing the gguf-dump and perplexity values of your quant as well. Interestingly despite using the same wiki.test.raw the values do not look comparable to mine seen in the above chart.
If I normalize mine to my own with something like np.log(quant/base), then perhaps I could compare them.
# np.log is natural log ln()
>>> import numpy as np
# values from anikifoss
>>> base=3.5184
>>> qs = [3.5184, 3.5308, 3.5415, 3.8099, 3.9535]
>>> [np.log(q/base)*100 for q in qs]
# base, DQ4_K_R4, Q4_K_R4, DQ2_K_R4, Q2_K_R4
[0.0, 0.35181333455809527, 0.6544025393180614, 7.959660125663863, 11.659492171048505]
# values from ubergarm quants
>>> base=3.2199
>>> qs = [3.2199, 3.2286, 3.2730,3.5069, 4.8831]
>>> [np.log(q/base)*100 for q in qs]
# base, IQ4_KS_R4, IQ3_K_R4, IQ2_K_R4, IQ1_S_R4
[0.0, 0.26983035678405687, 1.6356692345592185, 8.538215317827454, 41.64299609099737]
# lower is better. keep in mind this was wiki.test.raw english and not coding stuff or other languages etc.
Sorry its hard to make sense of without a graph haha... My guess is my numbers are lower for similar sized quants because I used imatrix, but in terms of "does it vibe code better" I honestly couldn't tell you haha... Or it could just be anikifoss's numbers were scaled larger for some reason and so they aren't actually comparable.
Anyway, exciting to have more high quality ik quants to choose from! Thanks again for publishing and sharing all the details!
The Q8_0 baseline should not be affected by quantization issues or imatrix. I couldn't find the exact source for my perplexity sample, I assumed it was the same as posted in ik_llama discussion, but maybe not.
I assumed it was the same as posted in ik_llama discussion
Yeah I've posted my methodology but it is buried in various folds and I can never find it myself lol (and its slightly changed). I thought you used the same though, but don't see the reference on the model card now, i thought i had seen it.. anyway here is what I'm doing currently:
$ wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz
$ gunzip wiki.test.raw.gz
$ du -h wiki.test.raw
1.3M wiki.test.raw
$ sha256sum wiki.test.raw
173c87a53759e0201f33e0ccf978e510c2042d7f2cb78229d9a50d79b9e7dd08 wiki.test.raw
$ ./build/bin/llama-perplexity \
--model /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ3_K_R4.gguf \
-f wiki.test.raw \
--seed 1337 \
--ctx-size 512 \
-mla 3 -fa \
-amb 512 \
-fmoe \
--n-gpu-layers 99 \
-ot "blk\.(3|4|5|6|7|8)\.ffn_.*=CUDA0" \
-ot "blk\.(9|10|11|12|13)\.ffn_.*=CUDA1" \
--override-tensor exps=CPU \
--threads 24
Final estimate: PPL = 3.2730 +/- 0.01738
Yeah, I just removed the reference from the model card to avoid confusion (I copy pasted the link from ik_llama assuming it was the same)
Yeah, I just removed the reference from the model card to avoid confusion (I copy pasted the link from ik_llama assuming it was the same)
Oooh, I see, you just deleted the reference because the file you used was different from the one i used if I understand you correctly. That would make sense why we have different values then. Thanks for clarifying!
Isn't the problem caused by the same data being used for both the quantization optimization (IMatrix) and the test (perplexity) ?
How hard would it be to compute perplexity scores on (a subset of) the stack or any other code dataset ?
I think it would give us an idea of the impact of the IMatrix on code generation, if any.
What do you think ?
What makes LLMs valuable for coding is not just the knowledge of a particular programming language, but also knowledge about all the problem domains. In other words, when you're trying to automate something, it helps to have an encyclopedic knowledge about the task you're automating. For small quants, imatrix is good at improving specific benchmarks at the expense of general knowledge. If you have a narrow problem domain (like game development) and a fixed language, like Python, then imatrix could be useful to produce a model that is good at game development in Python while being very compact. The approach I've taken is the opposite: a generalist model that is large and in charge.
Isn't the problem caused by the same data being used for both the quantization optimization (IMatrix) and the test (perplexity) ?
Heya! I purposely do not use wiki.test.raw or wiki test in general in my imatrix corpus to avoid potentially overfitting it as it is a common benchmark corpus (that I also use). For this same reason I do my KLD calculations on a private corpus of "novel" text that likely has not been used for training or imatrix fitting etc. Mine includes a variety of text, code, maths, and various languages to hopefully not over-fit any single domain, but who knows really!
How hard would it be to compute perplexity scores on (a subset of) the stack or any other code dataset ?
You can run any utf8 text file as the corpus if you prepare it. Just replace wiki.test.raw
in my commands with your file yes.
I think it would give us an idea of the impact of the IMatrix on code generation, if any.
What do you think ?
And as @anikifoss says:
The approach I've taken is the opposite: a generalist model that is large and in charge.
Yeah at some point part of this is a bit of a dark art. lmao.. Check out this old llama.cpp discussion on the very topic going on over a year ago haha... Unsloth is now using some 12k context and synthetic per model architechture imatrix corpus supposedly using specific tokenizations and such. Bartowski has been mixing up his and especially had to add more tokens for Qwen3-30B-A3B as the experts can be quite sparse and need a varity of data to even activate.
With aniki's quant it is so big that imatrix probably wouldn't help much as it imatrix is supposedly only really helpful for under 4bpw or so.
We can take measurements of PPL and KLD but they are quite sensitive to exact parameters and corpus used as well which can make it hard to compare "apples-apples" but I find it useful for comparing a collection of quants of the same model all made with the same imatrix and same way at least. I do sometimes compare across quants of he same model but it becomes more difficult to say much beyond that.
Anyway its fun stuff for sure! I learned a lot today and thanks for the tip on that pesky attn_k_b
tensor earlier today too, aniki!
@ubergarm here are the perplexity numbers:
>>>>>> Q2_K_R4
Final estimate: PPL = 3.7371 +/- 0.02053
>>>>>> DQ2_K_R4
Final estimate: PPL = 3.5520 +/- 0.01928
>>>>>> Q4_K_R4
Final estimate: PPL = 3.2368 +/- 0.01714
>>>>>> DQ4_K_R4
Final estimate: PPL = 3.2276 +/- 0.01708
>>>>>> Q8_0
Final estimate: PPL = 3.2121 +/- 0.01698
And the full command line (click to expand)
echo ">>>>>> DQ4_K_R4" && \
./build/bin/llama-perplexity \
--model /mnt/data/Models/anikifoss/DeepSeek-R1-0528-DQ4_K_R4/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf \
-f /mnt/data/Datasets/wiki.test.raw \
--no-mmap \
-ctk f16 \
-mla 3 -fa \
-amb 1024 \
-fmoe \
--seed 1337 \
--ctx-size 512 \
-b 2048 -ub 2048 \
--n-gpu-layers 99 \
--override-tensor exps=CPU,attn_kv_b=CPU \
--parallel 1 \
--threads 32 && \
echo ">>>>>> Q2_K_R4" && \
./build/bin/llama-perplexity \
--model /mnt/data/Models/anikifoss/DeepSeek-R1-0528-Q2_K_R4/DeepSeek-R1-0528-Q2_K_R4.gguf \
-f /mnt/data/Datasets/wiki.test.raw \
--no-mmap \
-ctk f16 \
-mla 3 -fa \
-amb 1024 \
-fmoe \
--seed 1337 \
--ctx-size 512 \
-b 2048 -ub 2048 \
--n-gpu-layers 99 \
--override-tensor exps=CPU,attn_kv_b=CPU \
--parallel 1 \
--threads 32 && \
echo ">>>>>> DQ2_K_R4" && \
./build/bin/llama-perplexity \
--model /mnt/data/Models/anikifoss/DeepSeek-R1-0528-DQ2_K_R4/DeepSeek-R1-0528-DQ2_K_R4.gguf \
-f /mnt/data/Datasets/wiki.test.raw \
--no-mmap \
-ctk f16 \
-mla 3 -fa \
-amb 1024 \
-fmoe \
--seed 1337 \
--ctx-size 512 \
-b 2048 -ub 2048 \
--n-gpu-layers 99 \
--override-tensor exps=CPU,attn_kv_b=CPU \
--parallel 1 \
--threads 32 && \
echo ">>>>>> Q4_K_R4" && \
./build/bin/llama-perplexity \
--model /mnt/data/Models/anikifoss/DeepSeek-R1-0528-Q4_K_R4/DeepSeek-R1-0528-Q4_K_R4.gguf \
-f /mnt/data/Datasets/wiki.test.raw \
--no-mmap \
-ctk f16 \
-mla 3 -fa \
-amb 1024 \
-fmoe \
--seed 1337 \
--ctx-size 512 \
-b 2048 -ub 2048 \
--n-gpu-layers 99 \
--override-tensor exps=CPU,attn_kv_b=CPU \
--parallel 1 \
--threads 32 && \
echo ">>>>>> Q8_0" && \
./build/bin/llama-perplexity \
--model /mnt/data/Models/anikifoss/DeepSeek-R1-0528-Q8_0/DeepSeek-R1-0528-Q8_0.gguf \
-f /mnt/data/Datasets/wiki.test.raw \
--no-mmap \
-ctk f16 \
-mla 3 -fa \
-amb 1024 \
-fmoe \
--seed 1337 \
--ctx-size 512 \
-b 2048 -ub 2048 \
--n-gpu-layers 99 \
--override-tensor exps=CPU,attn_kv_b=CPU \
--parallel 1 \
--threads 32
OK. I'm downloading this quant to compare it with DeepSeek-R1-0528-IQ4_KS_R4 . I'll see how both fare on my programming tasks !
here are the perplexity numbers
I think your quant "wins" at the best reported perplexity quant at least that I've seen published with this methodology! Very nice job!
Wow thanks so much for being thorough and including the commands and everything. Your Q8_0 seems to be very close to what mine was so our methodologies likely align sufficiently to compare, but of course this is just wiki.test.raw so I don't want to generalize too much.
I don't know your DQ4_K_R4
exact size GiB and BPW (I use the numbers printed out in debug logs of llama-server grep'ing for BPW), but did a rough estimate using the file sizes reported by huggingface to add yours to this graph:
Not sure if you are interested in doing another one, but if you went with all iq5_ks
for down/gate/up it would likely be faster without sacrificing much/any quality I'm guessing. It would end up a similar size or even possibly bigger (i haven't calculated exactly). The iq4/5_ks
quants tend to be faster psure, although I'm not fully sure on this model given a lot of CPU offload.
No pressure at all, just day dreaming about what other possible options would work in the larger size range you seem to enjoy!
Cheers!
Yeah -ctk q8_0
usually has a slightly "worse" PPL than full -ctk f16
similar to your linked data is showing. That matches my experience here as I ran my baseline pure q8_0 quant both ways at some point:
- Q8_0 -ctk f16:
3.2119 +/- 0.01697
- Q8_0 -ctk q8_0
3.2130 +/- 0.01698
Its not too bad though, and I do use q8_0 a lot if I want the extra VRAM for something else like context or offload another layer.
@BernardH that's awesome! Please keep us posted: what language/domain you applied them to and the results.
I must have a corrupted file "Oops(ggml_compute_forward_sum_rows_f32, ffn_moe_weights_sum-5): found nan for i1 = 0, i2 = 0, i3 = 0. ne00 = 256" :'(
Would you mind posting hashs (md5sum ? whatever) for the 10 files so that I know which one to download again (of course I didn't use xet because it was so slooooow last time I tried).
Thx !
@BernardH that's awesome! Please keep us posted: what language/domain you applied them to and the results.
I must have a corrupted file "Oops(ggml_compute_forward_sum_rows_f32, ffn_moe_weights_sum-5): found nan for i1 = 0, i2 = 0, i3 = 0. ne00 = 256" :'(
Would you mind posting hashs (md5sum ? whatever) for the 10 files so that I know which one to download again (of course I didn't use xet because it was so slooooow last time I tried).
Thx !
A quick way to get all the checksums is to git-clone the huggingface repo without the LFS plugin. Then each large file is simply a placeholder text file with some metadata, including sha256.
I also run sha256 manually to make sure the original source matches the repo (it does).
a532a14dffe840f8c6a394417b3109f3f863323230bf76faba8ba1a11de43e79 *DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf
1f1acfdd50e4dfc2b0544bb568d42abc1e6b89fa93932e30e8a0093356e6f930 *DeepSeek-R1-0528-DQ4_K_R4-00002-of-00010.gguf
0a7a7746d28e4ba7a570209e4e311614716b6668480239ba9630b34c8252a7b5 *DeepSeek-R1-0528-DQ4_K_R4-00003-of-00010.gguf
edd9a617b2b123c2c9c21e2a591959c2f888d38df40a126c13a047ff1bef8d8c *DeepSeek-R1-0528-DQ4_K_R4-00004-of-00010.gguf
adeec3be8993aec156841c31a2f15d936b74add28d85fdfdbd3bfaa3c90daf0e *DeepSeek-R1-0528-DQ4_K_R4-00005-of-00010.gguf
888a7b26a560ed35641304cf75e6e6045cc2887b450c18d3fffcf28ba78a95c9 *DeepSeek-R1-0528-DQ4_K_R4-00006-of-00010.gguf
98bbb26be716a8f9df5365bdbc733339c9b6cf982093885362e39c1ca608ef65 *DeepSeek-R1-0528-DQ4_K_R4-00007-of-00010.gguf
b9a3eca73b3b2986e5dc69791aaa473b0659ee71a35370ef3d902622f9fe9396 *DeepSeek-R1-0528-DQ4_K_R4-00008-of-00010.gguf
7f40cfa6dd62f2c45fdee939120705debf15fd9a108ec76b3840a58d4515a5ec *DeepSeek-R1-0528-DQ4_K_R4-00009-of-00010.gguf
5579f17934f1df9c53dcc9870b80ede5d2fc0af9e03a7c394ae322bf7570afac *DeepSeek-R1-0528-DQ4_K_R4-00010-of-00010.gguf
Thx !
Running a sweep bench right now.
I noticed that your example command contains the following sampling parameter values « --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 ». Any reason to have these instead of the recommended values of «--temp 0.6 --top-p 0.95» from https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF ?
Those are just my preferences that I found work best for coding. From my experience, min_p is more predictable, since it avoid highly random tokens altogether, so the model stays on track for longer tasks. Take a look at this deep dive into how min_p works. However, this is largely a matter of preference, so what works for you will likely be different.
This ik_llama.cpp PR533 just opened that could boost your quants PP quite a bit, especially for larger batch sizes! Spread the good word! haha
Thank you so much for releasing this. I've been trying to troubleshoot some performance issues I've been seeing with my system, and these quants are the only ones I've been able to use reliably. I would be curious if anyone has insight into what I am seeing.
Setup:
- ik_llama.cpp (bce7697d64dc09d52dec468b7ed69c768967b8b6)
- Compiled with:
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_BLAS=OFF -DCMAKE_CUDA_ARCHITECTURES="120-real"
- Compiled with:
- Ubuntu 24.04
- Dual Xeon 8480+
- 1TB RAM (16x64, 2DPC)
- RTX 5090
The command I am running:
$ numactl --interleave=all \
build/bin/llama-sweep-bench \
-m /mnt/data/models/IK_GGUF/anikifoss_DeepSeek-R1-0528-DQ4_K_R4/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf \
--numa distribute \
-ctk q8_0 \
-mla 3 \
-fa \
-amb 512 \
-b 2048 \
-ub 2048 \
-fmoe \
-ngl 99 \
--override-tensor exps=CPU \
-c 65536
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
2048 | 512 | 0 | 10.638 | 192.52 | 49.649 | 10.31 |
2048 | 512 | 2048 | 10.928 | 187.41 | 50.936 | 10.05 |
but if I run the exact same command against
@ubergarm
's IQ4_KS_R4
, prompt processing tanks:
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
2048 | 512 | 0 | 46.112 | 44.41 | 47.026 | 10.89 |
2048 | 512 | 2048 | 46.048 | 44.48 | 48.171 | 10.63 |
the only way I can recover prompt processing performance is by reducing --ubatch-size
from 2048
to 512
:
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 3.594 | 142.48 | 11.975 | 10.69 |
512 | 128 | 512 | 3.564 | 143.66 | 11.924 | 10.73 |
What I noticed is that the IQ4_KS_R4
will utilize all my cores until a point at which only one core will be utilized and the rest are left idle. At that time, the process is busy spending most of it's time copying data from CPU to GPU via cudaMemcpyAsync
, and nvtop
shows an uptick in bandwidth (PCIe GEN 5@16x RX: 8.681 GiB/s TX: 777.1 MiB/s
). I wasn't sure if this was expected behavior, but mainline llama.cpp was significantly worse.
From my testing, IQ quants can be generally slower than Q quants, but the increase of S_PP t/s
with reduced batch size is strange. Try with -mla 2
, it's been out longer and is better tested.
@Kebob interesting thanks for the benchmarks.
Your commands look reasonable though here are a few thoughts:
- You might have enough free VRAM left over to offload one or more additional layers onto that GPU for additional TG speeds.
- I assume you have configured BIOS for
SNC=Disabled
to present a single NUMA node per socket or similar configuration (more NUMA nodes generally reduces performance). - I don't see
--threads
nor--threads-batch
?? You will want want to dial that in for your CPUs by benchmarking. Also consider disabling numa balancing and testing with and without that.
Regarding the differences in performance and batch sizes, ik's fork moves fast and some recent PRs have changed the performance of various quants across CPU and CUDA backends already since we released these quants. Check out PR531 (a set of 3 performance boosters), and more recent PR557 for some more discussion. The upshot is that PP performance of IQX_K quants is suddenly recently faster than IQX_KS quants depending on backend and exact mix last I checked.
Basically you will need to benchmark a few variations like you're doing across a few batch sizes and dial in whatever is working best for your rig. Appreciate you sharing your findings so other folks might see what to expect as well.
Thanks for the quick replies. I'll try to do some more testing.
You might have enough free VRAM left over to offload one or more additional layers onto that GPU for additional TG speeds.
Yeah, I've been trying to balance prompt processing speed and token generation speed. I was playing with the batch size to improve the former, but that's when I noticed the performance regression with the IQ quants on my system.
I assume you have configured BIOS for SNC=Disabled to present a single NUMA node per socket or similar configuration (more NUMA nodes generally reduces performance).
Correct, I have SNC disabled. I also tried using numactl --cpunodebind=0 --membind=0
(corresponding to the PCIe slot of my GPU) to no avail.
I don't see --threads nor --threads-batch?? You will want want to dial that in for your CPUs by benchmarking. Also consider disabling numa balancing and testing with and without that.
My understanding is that if I leave this off, it will default to all cores. llama-sweep-bench
shows: main: n_kv_max = 65536, n_batch = 2048, n_ubatch = 2048 flash_attn = 1, n_gpu_layers = 99, n_threads = 112, n_threads_batch = 112
, and I have two 56 core processors (HT is disabled). Are you suggesting using fewer cores? I did try disabling half the cores on each processor via BIOS settings, and that seemed to help a bit with TG but hurt PP. It did not seem to make a difference in the issue I've been seeing with increasing --ubatch-size
on IQ quants, unfortunately.
You will take a big performance hits if you use too many cores because of CPU cache thrashing. A general rule of thumb for running LLMs on the CPU is to limit CPU cores to the number of physical cores. Some people go as far as disabling SMT in BIOS.
Correct, I have SNC disabled. I also tried using numactl --cpunodebind=0 --membind=0 (corresponding to the PCIe slot of my GPU) to no avail.
Yeah def try backing off on the cores. In my experience on a dual socket Intel Xeon 6980P while PP might benefit from using both cores, I generally run the model on a single core to avoid cross NUMA stuff. I don't bother to disable SMT (hyperthreads) so haven't tested really. Given these quants will fit in a single NUMA node's worth of RAM (512GIB) def give that a try with:
numactl numactl -N 0 -m 0 ..... (rest of your command)..... --numa numactl
Then I'd use --threads 56 --threads-batch 56
and start from there. Definitely test out reducing --threads-batch 32
for example as generally TG is RAM bandwidth limited and not compute limited (except for those new KT trellis quants... different story there)...
Finally check out and download intel memory latency checker mlc app and measure performance and you will see how big of a hit you will take crossing NUMA domains between RAM/CPU.
#!/usr/bin/env bash
# grab original hugepages value
hp=$(cat /proc/sys/vm/nr_hugepages)
# set it based on mlc documentation
echo 4000 | sudo tee /proc/sys/vm/nr_hugepages
sudo ./Linux/mlc | tee -a output.log
# now restore original value
echo "$hp" | sudo tee /proc/sys/vm/nr_hugepages
So right now I'm pretty happy with the performance I'm getting with the Q4_K_R4 quants, though there is definitely still room for improvement if I want to try to optimize for PP or TG. From what I can see from testing, I don't think there's anything that will improve both PP and TG together -- I would be either be trading off one for the other or just increasing one. My main concern right now is the very poor PP for IQ4_KS_R4 that seems to only be plaguing my setup. It doesn't seem like anyone else is having this issue. Perhaps it makes more sense as a discussion in @ubergarm 's repository to confirm if anyone else is seeing the same issue, but I'll finish my current thought.
My understanding is that there for some operations, the scheduler will copy data from CPU to the GPU to perform matrix operations. My current hypothesis is that for Q4_K_R4, this either isn't occurring as frequently or is happening faster because I don't see a long duration of sustained RX from CPU -> GPU. For IQ4_KS_R4, I am seeing RX from CPU -> GPU of ~9GiB/s which seems pretty low, and it seems to be sustained for about a minute per batch. I have confirmed that my card is running at PCIe Gen 5@16x, and I've used nvbandwidth -t host_to_device_memcpy_ce
and get ~54GiB/s copies from CPU -> GPU. https://github.com/ikawrakow/ik_llama.cpp/pull/239 was what led me to this theory as I have a similar issue with mainline llama.cpp with a batch size >= 32.
aaand... as I typed this all out and reread the GPU offload policy thread, I've found that adding -op 26,0,27,0,29,0
(which disables offloading ops MUL_MAT, MUL_MAT_ID, and MOE_FUSED_UP_GATE) recovers the PP performance for
@ubergarm
's IQ4_KS_R4 quant for me while still allowing me to increase batch/ubatch sizes for further PP performance gains. I still need to do some more digging because I wonder how much performance I am leaving behind by having these ops disabled, and it still doesn't make sense to me why ik_llama.cpp
seems to be CPU->GPU memory bandwidth copy bound.
Comparing Q4_K_R4 (control) vs IQ4_KS_R4 (treatment) using -op 26,0,27,0,29,0
trades off slightly worse PP but slightly better TG:
PP | TG | N_KV | S_PP t/s | S_TG t/s | <- C - T -> | S_PP t/s | S_TG t/s |
---|---|---|---|---|---|---|---|
2048 | 512 | 0 | 189.80 | 11.15 | 109.2% / 98.0% | 173.78 | 11.38 |
2048 | 512 | 2048 | 185.34 | 10.99 | 109.0% / 96.4% | 169.99 | 11.40 |
2048 | 512 | 4096 | 179.24 | 10.88 | 108.8% / 96.9% | 164.67 | 11.23 |
2048 | 512 | 6144 | 179.96 | 10.81 | 107.9% / 96.8% | 166.74 | 11.17 |
AVG | - | - | 183.59 | 10.96 | 108.8% / 97.0% | 168.80 | 11.29 |
Nice job digging into it deeper and thanks for sharing how to use nvbandwidth
tool (that is new to me and looks useful).
If you want to get into it deeper you could reply to one of the discussions that looks appropriate on ik_llama.cpp and link this along with your final chart and observations perhaps?
Makes sense the KS version would be faster TG strictly from the smaller sized active parameters requiring less RAM bandwidth.I've personally never dug into the various -op
stuff but see it mentioned in some PRs occasionally.
Did you see any speed-ups going to a single CPU socket with numactl -N 0 -m 0 llama-server --numa numactl <rest of your command>
??
If you want to get into it deeper you could reply to one of the discussions that looks appropriate on ik_llama.cpp and link this along with your final chart and observations perhaps?
That's a good idea. I'll post a follow-up on the ik_llama.cpp repo once I finish digging a little more.
Makes sense the KS version would be faster TG strictly from the smaller sized active parameters requiring less RAM bandwidth.
Yeah, that makes sense. I came to that realization as I was testing other quants.
Did you see any speed-ups going to a single CPU socket with numactl -N 0 -m 0 llama-server --numa numactl ??
I do see a modest TG speed-up when I use half the physical cores available (112 -> 56), but I also see a drop in PP which is to be expected:
Comparing Q4_K_R4 (control) vs Q4_K_R4 w/ --threads 56 --threads-batch 56
(treatment):
PP | TG | N_KV | S_PP t/s | S_TG t/s | <- C - T -> | S_PP t/s | S_TG t/s |
---|---|---|---|---|---|---|---|
2048 | 512 | 0 | 182.65 | 11.38 | 121.2% / 93.2% | 150.67 | 12.21 |
2048 | 512 | 2048 | 179.91 | 10.91 | 121.8% / 89.7% | 147.74 | 12.16 |
2048 | 512 | 4096 | 174.95 | 10.89 | 121.3% / 89.7% | 144.23 | 12.14 |
2048 | 512 | 6144 | 175.43 | 10.70 | 121.6% / 89.7% | 144.21 | 11.93 |
AVG | - | - | 178.24 | 10.97 | 121.5% / 90.6% | 146.71 | 12.11 |
Unfortunately, for my system, using numactl -N 0 -m 0
reduces both TG and PP because it only leverages half the memory bandwidth of my system in that case:
Comparing Q4_K_R4 (control) vs Q4_K_R4 w/ numactl -N 0 -m 0 llama-sweep-bench --numa numactl [...]
(treatment):
PP | TG | N_KV | S_PP t/s | S_TG t/s | <- C - T -> | S_PP t/s | S_TG t/s |
---|---|---|---|---|---|---|---|
2048 | 512 | 0 | 182.65 | 11.38 | 161.4% / 138.8% | 113.18 | 8.20 |
2048 | 512 | 2048 | 179.91 | 10.91 | 161.0% / 124.8% | 111.77 | 8.74 |
2048 | 512 | 4096 | 174.95 | 10.89 | 159.0% / 124.7% | 110.04 | 8.73 |
2048 | 512 | 6144 | 175.43 | 10.70 | 159.3% / 124.1% | 110.12 | 8.62 |
AVG | - | - | 178.24 | 10.97 | 160.2% / 128.1% | 111.28 | 8.57 |
I don't know how accurate this is, but that makes me think that if I had all 8 channels on a single processor, I would get closer to 16 t/s TG (EDIT: actually, it wouldn't be exactly 2x, since some of the TG is occurring on the GPU).
I do see a modest TG speed-up when I use half the physical cores available (112 -> 56), but I also see a drop in PP which is to be expected:
Comparing Q4_K_R4 (control) vs Q4_K_R4 w/ --threads 56 --threads-batch 56 (treatment):
Yeah you can mix and match to get the best of both worlds e.g. --threads 56 --threads-batch 112
which would automatically switch between 56 threads for TG and 112 threads for PP. (sorry i think i had it backwards above, always get this confused and have to refer to llama-server --help
lol)
Unfortunately, for my system, using numactl -N 0 -m 0 reduces both TG and PP because it only leverages half the memory bandwidth of my system in that case:
Huh interesting, in my previous experience on that dual socket xeon 6980P writeup the penalty from cross numa node made it unable to actually take advantage of the theoretical extra memory bandwidth for TG. Yes PP can take advantage of both cores because more CPU.
You can measure it with mlc
as I mentioned, but seems like in your specific rig with your BIOS configs it appears to be helping so go for it.
Only other thing I am wondering is about your nvidia driver and having a new 5090 Blackwell GPU. Recently heard some ramblings on reddit about speed regressions on newer drivers for vllm but not sure it would apply here, just pointing out there are a lot of variables haha...
Anyway, nice job spelunking and measuring and benchmarking! Curious to see how you make out!
Yeah you can mix and match to get the best of both worlds e.g. --threads 56 --threads-batch 112 which would automatically switch between 56 threads for TG and 112 threads for PP. (sorry i think i had it backwards above, always get this confused and have to refer to llama-server --help lol)
Thanks for the tip - it worked like a charm!
Only other thing I am wondering is about your nvidia driver and having a new 5090 Blackwell GPU. Recently heard some ramblings on reddit about speed regressions on newer drivers for vllm but not sure it would apply here, just pointing out there are a lot of variables haha...
Yeah I actually came across that as well and was going to look into it more, but I think with a 5090 I'm limited to a few versions that I can try.
My main concern right now is the very poor PP for IQ4_KS_R4 that seems to only be plaguing my setup. It doesn't seem like anyone else is having this issue. Perhaps it makes more sense as a discussion in @ubergarm 's repository to confirm if anyone else is seeing the same issue, but I'll finish my current thought.
I am seeing the same or similar behavior, @Kebob . Some benchmarks I just did are here.