Ik_llama.cpp speed improvements over llama.cpp

#3
by ciprianv - opened

Thank you for your work. With ik_llama.cpp and this quant and -ot settings, compared with the unsloth ud Q3 XL quant using -ot for ffn, i get double the prompt processing speed, from 12 to 24t/s and same gen speed 7.4 t/s. Are there still ways to improve? Especially at the prompt process speed? I am using a 3090 24gb, 128gb 2933Mt/s Ram and a 2950x AMD TR 16cores cpu on ubuntu. Having only AVX2 and not newer version is the problem for the prompt processing speed?
This is my command and results:
./build/bin/llama-sweep-bench --model ~/ai/models/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K -fa -ctk q8_0 -ctv q8_0 -c 32768 -fmoe -amb 512 -rtr -ot blk.1[2-9].ffn.=CPU -ot blk.[2-8][0-9].ffn.=CPU -ot blk.9[0-3].ffn.*=CPU -ngl 99 --threads 16 --host 0.0.0.0 --port 5002

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 21.289 24.05 17.568 7.29
512 128 512 21.913 23.37 17.619 7.26
512 128 1024 21.274 24.07 18.157 7.05
512 128 1536 20.882 24.52 18.046 7.09
512 128 2048 20.189 25.36 17.913 7.15
512 128 2560 21.139 24.22 18.439 6.94

@ciprianv

Nice job, glad you're getting some big speed ups by optimizing your configuration!

Having only AVX2 and not newer version is the problem for the prompt processing speed?

I'm not 100% sure, but guessing the limiting factor is your RAM bandwidth.

./build/bin/llama-sweep-bench \
    --model ~/ai/models/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
    --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
    -fa \
    -ctk q8_0 -ctv q8_0 \
    -c 32768 \
    -fmoe \
    -amb 512 \
    -rtr \
    -ot blk.1[2-9].ffn.=CPU \
    -ot blk.[2-8][0-9].ffn.=CPU \
    -ot blk.9[0-3].ffn.*=CPU \
    -ngl 99 \
    --threads 16 \
    --host 0.0.0.0 \
    --port 5002

Your command looks pretty good, offloading the first 12 ffn tensors' layers or so to GPU and the rest on CPU using -rtr for max CPU/RAM cache performance.

Since you have more RAM than my setup, you could possibly make your own IQ4_KS quant which inferences faster than the ones I chose for this quant which would use more RAM though.

Otherwise some people are trading tips on a discord called "Beaver AI Club" and I saw a recent reddit post with some ideas here: https://www.reddit.com/r/LocalLLaMA/comments/1ki7tg7/dont_offload_gguf_layers_offload_tensors_200_gen/

Good luck!

Strange, I more than doubled again the prompt processing speed with ik_llama by removing -rtr and -fmoe, probably there was some missing optimization with my older cpu?

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 7.602 67.35 15.631 8.19
512 128 512 7.614 67.24 15.908 8.05
512 128 1024 7.575 67.59 15.904 8.05

Make sure to pull the most recent updates to ik_llama.cpp for even faster PP!

I'm still surprised that -rtr and -fmoe were slower for you, i know repacked quants can't actually run on GPU but it it should only repack tensors going onto CPU:

# ik_llama.cpp/src/llama.cpp line 9310
if (ggml_backend_buffer_is_host(it.second->buffer)) {
// repack tensor
}

Sign up or log in to comment