Ik_llama.cpp speed improvements over llama.cpp
Thank you for your work. With ik_llama.cpp and this quant and -ot settings, compared with the unsloth ud Q3 XL quant using -ot for ffn, i get double the prompt processing speed, from 12 to 24t/s and same gen speed 7.4 t/s. Are there still ways to improve? Especially at the prompt process speed? I am using a 3090 24gb, 128gb 2933Mt/s Ram and a 2950x AMD TR 16cores cpu on ubuntu. Having only AVX2 and not newer version is the problem for the prompt processing speed?
This is my command and results:
./build/bin/llama-sweep-bench --model ~/ai/models/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K -fa -ctk q8_0 -ctv q8_0 -c 32768 -fmoe -amb 512 -rtr -ot blk.1[2-9].ffn.=CPU -ot blk.[2-8][0-9].ffn.=CPU -ot blk.9[0-3].ffn.*=CPU -ngl 99 --threads 16 --host 0.0.0.0 --port 5002
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 21.289 | 24.05 | 17.568 | 7.29 |
512 | 128 | 512 | 21.913 | 23.37 | 17.619 | 7.26 |
512 | 128 | 1024 | 21.274 | 24.07 | 18.157 | 7.05 |
512 | 128 | 1536 | 20.882 | 24.52 | 18.046 | 7.09 |
512 | 128 | 2048 | 20.189 | 25.36 | 17.913 | 7.15 |
512 | 128 | 2560 | 21.139 | 24.22 | 18.439 | 6.94 |
Nice job, glad you're getting some big speed ups by optimizing your configuration!
Having only AVX2 and not newer version is the problem for the prompt processing speed?
I'm not 100% sure, but guessing the limiting factor is your RAM bandwidth.
./build/bin/llama-sweep-bench \
--model ~/ai/models/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
--alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
-fa \
-ctk q8_0 -ctv q8_0 \
-c 32768 \
-fmoe \
-amb 512 \
-rtr \
-ot blk.1[2-9].ffn.=CPU \
-ot blk.[2-8][0-9].ffn.=CPU \
-ot blk.9[0-3].ffn.*=CPU \
-ngl 99 \
--threads 16 \
--host 0.0.0.0 \
--port 5002
Your command looks pretty good, offloading the first 12 ffn tensors' layers or so to GPU and the rest on CPU using -rtr
for max CPU/RAM cache performance.
Since you have more RAM than my setup, you could possibly make your own IQ4_KS
quant which inferences faster than the ones I chose for this quant which would use more RAM though.
Otherwise some people are trading tips on a discord called "Beaver AI Club" and I saw a recent reddit post with some ideas here: https://www.reddit.com/r/LocalLLaMA/comments/1ki7tg7/dont_offload_gguf_layers_offload_tensors_200_gen/
Good luck!
Strange, I more than doubled again the prompt processing speed with ik_llama by removing -rtr and -fmoe, probably there was some missing optimization with my older cpu?
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 7.602 | 67.35 | 15.631 | 8.19 |
512 | 128 | 512 | 7.614 | 67.24 | 15.908 | 8.05 |
512 | 128 | 1024 | 7.575 | 67.59 | 15.904 | 8.05 |
Make sure to pull the most recent updates to ik_llama.cpp for even faster PP!
I'm still surprised that -rtr
and -fmoe
were slower for you, i know repacked quants can't actually run on GPU but it it should only repack tensors going onto CPU:
# ik_llama.cpp/src/llama.cpp line 9310
if (ggml_backend_buffer_is_host(it.second->buffer)) {
// repack tensor
}