Ik_llama.cpp speed improvements over llama.cpp

by ciprianv - opened May 11

May 11

•

Thank you for your work. With ik_llama.cpp and this quant and -ot settings, compared with the unsloth ud Q3 XL quant using -ot for ffn, i get double the prompt processing speed, from 12 to 24t/s and same gen speed 7.4 t/s. Are there still ways to improve? Especially at the prompt process speed? I am using a 3090 24gb, 128gb 2933Mt/s Ram and a 2950x AMD TR 16cores cpu on ubuntu. Having only AVX2 and not newer version is the problem for the prompt processing speed?
This is my command and results:
./build/bin/llama-sweep-bench --model ~/ai/models/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K -fa -ctk q8_0 -ctv q8_0 -c 32768 -fmoe -amb 512 -rtr -ot blk.1[2-9].ffn.=CPU -ot blk.[2-8][0-9].ffn.=CPU -ot blk.9[0-3].ffn.*=CPU -ngl 99 --threads 16 --host 0.0.0.0 --port 5002

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	21.289	24.05	17.568	7.29
512	128	512	21.913	23.37	17.619	7.26
512	128	1024	21.274	24.07	18.157	7.05
512	128	1536	20.882	24.52	18.046	7.09
512	128	2048	20.189	25.36	17.913	7.15
512	128	2560	21.139	24.22	18.439	6.94

ubergarm

Owner May 11

@ciprianv

Nice job, glad you're getting some big speed ups by optimizing your configuration!

Having only AVX2 and not newer version is the problem for the prompt processing speed?

I'm not 100% sure, but guessing the limiting factor is your RAM bandwidth.

./build/bin/llama-sweep-bench \
    --model ~/ai/models/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
    --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
    -fa \
    -ctk q8_0 -ctv q8_0 \
    -c 32768 \
    -fmoe \
    -amb 512 \
    -rtr \
    -ot blk.1[2-9].ffn.=CPU \
    -ot blk.[2-8][0-9].ffn.=CPU \
    -ot blk.9[0-3].ffn.*=CPU \
    -ngl 99 \
    --threads 16 \
    --host 0.0.0.0 \
    --port 5002

Your command looks pretty good, offloading the first 12 ffn tensors' layers or so to GPU and the rest on CPU using -rtr for max CPU/RAM cache performance.

Since you have more RAM than my setup, you could possibly make your own IQ4_KS quant which inferences faster than the ones I chose for this quant which would use more RAM though.

Otherwise some people are trading tips on a discord called "Beaver AI Club" and I saw a recent reddit post with some ideas here: https://www.reddit.com/r/LocalLLaMA/comments/1ki7tg7/dont_offload_gguf_layers_offload_tensors_200_gen/

Good luck!

ciprianv

May 11

Strange, I more than doubled again the prompt processing speed with ik_llama by removing -rtr and -fmoe, probably there was some missing optimization with my older cpu?

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	7.602	67.35	15.631	8.19
512	128	512	7.614	67.24	15.908	8.05
512	128	1024	7.575	67.59	15.904	8.05

ubergarm

Owner May 14

•

edited May 14

Make sure to pull the most recent updates to ik_llama.cpp for even faster PP!

I'm still surprised that -rtr and -fmoe were slower for you, i know repacked quants can't actually run on GPU but it it should only repack tensors going onto CPU:

# ik_llama.cpp/src/llama.cpp line 9310
if (ggml_backend_buffer_is_host(it.second->buffer)) {
// repack tensor
}

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment