Question about --ubatch-size and performance

#8
by sousekd - opened

Thank you @ubergarm for the great quants! I honestly don't understand how you managed to keep up over the past couple of weeks with all the model releases.

I realize this might not be the right place to ask, but during my initial testing, I found that increasing --ubatch-size to unusually high values still significantly improves PP t/s. Here's what I measured on IQ5_K with -ub 8192 -ot "blk\.([6-9]|[1-9][0-9])\.ffn_.*_exps=CPU":

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 2048 0 12.158 673.79 105.836 19.35
8192 2048 8192 13.072 626.68 107.701 19.02
8192 2048 16384 14.145 579.14 111.551 18.36
8192 2048 24576 15.052 544.23 114.911 17.82

And with -ub 16384 -ot "blk\.([3-9]|[1-9][0-9])\.ffn_.*_exps=CPU":

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
16384 4096 0 18.763 873.22 213.928 19.15
16384 4096 16384 22.734 720.70 226.658 18.07

I haven't had much time to test more or really explore the model yet :) but I keep thinking about what's possible with these large models on limited hardware - and how to put things together.

So my question is: does increasing the physical batch size to such unusually high values have any downsides besides increased VRAM usage (which might limit context size, reduce offloading options, or prevent using a higher --parallel setting)?

Many thanks again.

@sousekd

Those PP values indeed look very impressive. I'm not sure the limits of increasing batch sizes, but I vaguely recall ik mentioning some issues that might occur if you go too high, however I am unsure which model it was and if it applies here.

From a practical standpoint, if you want to confirm that you have numerical stability even at higher batch sizes you could run llama-perplexity like so and confirm that it finishes and gives a final perplexity value and that there are no nan that appear.

Adjust for your offload, threads, and increase batch sizes as desired:

$ wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
$ gunzip wiki.test.raw.gz
$ sha1sum wiki.test.raw
6f1fe2054a940eebfc76b284b09680763b37f5ea  wiki.test.raw

$ numactl -N 1 -m 1 \
./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
    --seed 1337 \
    -fa -fmoe \
    --ctx-size 512 \
    -ub 4096 -b 4096 \
    --numa numactl \
    --threads 128 \
    --threads-batch 192 \
    --no-mmap

...

Final estimate: PPL = ....

You can compare against the published values for my quants and know that you are within tolerance. If that is the case go for it! If you are concerned about valid output, though do search through the closed PRs on ik_llama.cp for disussions on -ub 4096 and similar, but sorry it is a pain to find information sometimes!

Thank you @ubergarm . With:

& '.\build\bin\llama-perplexity.exe' `
  -m $model `
  -f 'wiki.test.raw' `
  --seed 1337 `
  --no-mmap -fa -fmoe `
  -c 512 -amb 512 -b 16384 -ub 16384 `
  -ngl 999  -ot "blk\.([3-9]|[1-9][0-9])\.ffn_.*_exps=CPU" `
  --threads 32 --threads-batch 28 `
  --main-gpu 0

...I got Final estimate: PPL = 4.3169 +/- 0.02551, which is surprisingly quite different from yours (and closer to Q8_0).
With -c 32768 -amb 512 -b 16384 -ub 16384 I got Final estimate: PPL = 3.8585 +/- 0.02117, which is very different.

Can you help me understand those numbers?

(Feel free to respond later, I understand cooking Qwen3-Thinking quants is much more important and interesting thing to do! :))

@sousekd

Hrmm, lower perplexity is "better" in general. Let's see for the IQ5_K:

  • sousekd IQ5_K 4.3169 +/- 0.02551 (CUDA/cpu backend)
  • ubergarm IQ5_K 4.3351 +/- 0.0256 (cpu-only backend)

huh that is a fairly big difference, generally CPU and CUDA backends are quite similar.

Can you help me understand those numbers?

So that other number you got Final estimate: PPL = 3.8585 +/- 0.02117 was for much longer context -c 32768 and perplexity is very sensitive to changing that number. I basically leave it at 512 always for benchmarking comparisons. Some folks have suggested using specialized imatrix corpus and longer context for imatrix and perplexity testing but you can read an almost two year old thread on mainline llama.cpp discussing it and the discussions still go on today.

Just to keep things similar I try to use the exact same methodology now as shown above. You can increase -ub and -b but don't mess with the context when doing perplexity and see what you get!

Also remove -amb that only applies to MLA (deepseek/kimi) (and then maybe only for -mla 3 but don't quote me on that last bit).. Its not needed for Qwen though and doesn't do anything.

Hmm, I see.

I retested without -amb 512, and the result with -c 512 was exactly the same: 4.3169 ± 0.02551. Then I tried again with -b 4096 -ub 4096, and the result was 4.3177 ± 0.02551 - a tiny bit worse. Your result for IQ5_K was 4.3351 ± 0.02566, and for Q8_0 it was 4.3139 ± 0.02550.

Now we’re wondering why my IQ5_K tests produce lower perplexity values, closer to Q8_0. But what about that ±0.0255 tolerance? Doesn’t it mean all of these results — including Q8_0 — fall within the margin of error and should be considered effectively the same? :)

If not, I don't understand why my results differ from yours.
If yes, then it means you produced quants with the same perplexity as Q8_0, right?

I retested without -amb 512, and the result with -c 512 was exactly the same: 4.3169 ± 0.02551.

Okay that seems good and makes sense given -amb 512 only applies to MLA models

Then I tried again with -b 4096 -ub 4096, and the result was 4.3177 ± 0.02551 - a tiny bit worse.

That seems okay, i can't find the reference from ik in a discussion but he mentions some rounding error with large batch size but it masks out when the context is smaller than batch size so it should be okay. Seems pretty close about .0008

Now we’re wondering why my IQ5_K tests produce lower perplexity values, closer to Q8_0.

You're comparing your IQ5_K to my Q8_0 value though, so it might be possible your Q8_0 would also be lower if perhaps running on CUDA backend is slightly better perplexity given the numerical result could differ slightly. I don't know enough to say if this is more than usual or what.

I don't understand why my results differ from yours.

You could try recompiling with -DGGML_CUDA=OFF and wait a long time for it to do the calculation on the CPU-only path which could change it perhaps?

I'm not too concerned about it, but it is interesting how sensitive perplexity measurements can be. I do believe that the IQ5_K is quite good and very comparable to the Q8_0 but not the same. Given the perplexities of these big MoEs behave quite nicely I haven't bothered with doing KLD which could give another dimension to compare "how similar" the quant responds as compared to the full Q8_0. But for now using these perplexities run on the same rig with the same configuration allows me to make design choices in the speed-accuracy trade-off trying to do that gradient descent to push the pareto boundary!

Thank you, @ubergarm , for your patience.

After reviewing the wiki.test.raw data and understanding that results are only comparable between runs with identical parameters (and maybe even the same hardware), I see that published perplexity numbers should be taken with a grain of salt - and not assumed to mean more than they do.

I've got some reading to do on quantization and imatrix - otherwise, I'll just keep asking you stupid newbie questions all day. :)

After reviewing the wiki.test.raw data and understanding that results are only comparable between runs with identical parameters (and maybe even the same hardware), I see that published perplexity numbers should be taken with a grain of salt - and not assumed to mean more than they do.

Yeah, perplexity is great to compare the same model on the same hardware with the same configurations to inform if the tensor quantization choices are degrading the model too much or not in a relative manner.

Beyond that, definitely take a grain of salt. It is fun to compare my quants with unsloth and see my perplexity values tend to be a bit lower given the better quality quantizations available with ik_llama.cpp, but I don't worry too much about it.

I've got some reading to do on quantization and imatrix - otherwise, I'll just keep asking you stupid newbie questions all day. :)

We're all learning this stuff together, and it changes fast!! Experiment and share what you find and we'll all push that pareto curve down together!

We're all learning this stuff together, and it changes fast!! Experiment and share what you find and we'll all push that pareto curve down together!

Okay - let me take you up on that and ask one more thing :)

If the iMatrix calibration data assigns slightly more weight to Wikipedia than to other domains, and we then evaluate the perplexity on another slice of Wikipedia, aren't we effectively testing a best‑case scenario while overlooking potential quality losses elsewhere? I mean, wouldn't it be better to measure perplexity across a broader mix of corpora?

If the iMatrix calibration data assigns slightly more weight to Wikipedia than to other domains, and we then evaluate the perplexity on another slice of Wikipedia, aren't we effectively testing a best‑case scenario while overlooking potential quality losses elsewhere? I mean, wouldn't it be better to measure perplexity across a broader mix of corpora?

Right, this is why I don't use wiki text in my imatrix calibration corpus, knowing that I am using wiki.test.raw (which is kinda the standard across various inference engines and academic papers). Because yes in theory at least I believe you could somewhat "overfit" the importance matrix data to wiki.test.raw to get a "better" looking perplexity.

Another reason why comparing across quant cookers models can be challenging as unsloth doesn't release their corpus from what I have seen (they do upload the imatrix data file though).

I mean, wouldn't it be better to measure perplexity across a broader mix of corpora?

I get what you're saying, but then I ask you what are you trying to do or compare with this perplexity number? In the past I do keep a "secret novel" corpus for KLD comparisons to do my best to make sure no one has used that text for training or imatrix corpus purposes. It can give different results. I have some more written up here: https://www.reddit.com/r/LocalLLaMA/comments/1khwxal/the_great_quant_wars_of_2025/

Also at some point these measurements take quite a lot of time and compute so have to think about how we want to use them ahead of time. Generally speaking I want my imatrix corpus to:

  1. be short enough it doesn't take too long to run, e.g. under 2MiB
  2. be diverse enough to excercise most/all the experts in big MoEs (this is hard with Qwen3-235B by the way)
  3. be diverse enough to not "over-fit" just english or just code etc
  4. not contain wiki.test.raw or too similar text because I use that for my benchmarking
  5. have some blocks of text larger than the 512 token context window so it isn't too choppy, but this is just my superstition lol

There are a number of conversations about this stuff floating around old PRs too, it is pretty fascinating but personally I'm not convinced optimizing imatrix will give huge gains to model quality. I prefer and enjoy spending more time remixing the quant types for different tensors trying to balance speed and perplexity and so the perplexity values I do measure are sufficient for that internal relativistic comparing for me.

Thank you for more insights, @ubergarm .

The reason I'm so curious about this is that I'm not entirely sold on the iMatrix concept. My gut feeling is that rare or niche knowledge can be lost disproportionately during quantization that relies on mainstream calibration data. My primary language isn't in the top 25 - possibly not even the top 50 - and at work I need these models to process confidential data in rather obscure domains.

That's why I'd like to understand this better. It might be worth creating my own perplexity test data-set or, even better, my own iMatrix - but I have a lot to learn before I can do that.

I prefer and enjoy spending more time remixing the quant types for different tensors trying to balance speed and perplexity and so the perplexity values I do measure are sufficient for that internal relativistic comparing for me.

And you do amazing job! I just tested your Kimi‑K2 IQ4_KS quants and achieved very nice 365 PP t/s on my system with a 14K batch size. I wonder why some quants (like yours) benefit so much from larger batches, while others I tested do not ( @anikifoss 's, @bartowski 's).

That's why I'd like to understand this better. It might be worth creating my own perplexity test data-set or, even better, my own iMatrix - but I have a lot to learn before I can do that.

Sure you can look inside my imatrix corpus to see if the language in question is in there or not: https://gist.github.com/ubergarm/edfeb3ff9c6ec8b49e88cdf627b0711a (likely download the raw and maybe grep it?)

I wonder why some quants (like yours) benefit so much from larger batches

hrmm interesting question... i think the best PP performance comes from quants that use q8_k_r8 mulmat code path and I try to stick with those in my quants. You can see a list of those quants here: https://github.com/ikawrakow/ik_llama.cpp/pull/610#issuecomment-3105282864 as well as a link to A tier / B tier / C tier quants in terms of speed. Many of the mainline only e.g. q6_K is using "B-tier" Q8_0_R8 code path and then Q4_K/Q5_K seem to use "C-tier" Q8_1 path. This is strictly related to PP speed not quality or TG.

Also avoid using -rtr to avoid the _r4 quants now too given the non-repacked versions can more recently benefit from large -ub 4096 -b 4096. I think there is still a place for -rtr if using low batch size to avoid OOMing VRAM or prioritizing more layers offloading onto VRAM for more TG speed though so both options are available.

Sign up or log in to comment