Recommendation for 256 ram 48 vram
Hi, thank you for your work, I am happy to be able to run your version of deepseek v3 q2 at more than expected and quite usable speed and response quality:
CUDA_VISIBLE_DEVICES="0,1"
./build/bin/llama-sweep-bench
--model /home/ciprian/ai/models/DeepSeek-V3-0324-IQ2_K_R4/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf
--alias ubergarm/DeepSeek-R1-V3-0324-IQ2_K_R4
--ctx-size 35328
-ctk q8_0
-mla 3 -fa
-amb 512
-fmoe
--temp 0.3
--min-p 0.05
--n-gpu-layers 63
-ot "blk.[0-5].ffn_up_exps=CUDA0,blk.[0-5].ffn_gate_exps=CUDA0"
-ot "blk.1[0-3].ffn_up_exps=CUDA1,blk.1[0-3].ffn_gate_exps=CUDA1"
--override-tensor exps=CPU
--parallel 1
--threads 16
--host 0.0.0.0 --port 5002
--ubatch-size 5888 --batch-size 5888 --no-mmap
main: n_kv_max = 35328, n_batch = 5888, n_ubatch = 5888, flash_attn = 1, n_gpu_layers = 63, n_threads = 16, n_threads_batch = 16
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
5888 | 1472 | 0 | 32.179 | 182.98 | 183.204 | 8.03 |
5888 | 1472 | 5888 | 37.996 | 154.96 | 191.340 | 7.69 |
5888 | 1472 | 11776 | 43.817 | 134.38 | 198.617 | 7.41 |
5888 | 1472 | 17664 | 50.121 | 117.47 | 207.122 | 7.11 |
5888 | 1472 | 23552 | 57.124 | 103.07 | 215.082 | 6.84 |
5888 | 1472 | 29440 | 60.658 | 97.07 | 222.269 | 6.62 |
My question is, what would you recommend as quantization and command between the q2 and q3, will it be worth q3 as quality increase against q2, considering also the speed decrease and being a thinking model? I can add another 20gb a4500 along the 2x3090, but i dont know if it makes a difference.
As cpu, I have a 3955wx TR with 256 ddr4.
and also, can i squeeze something more from the builing params? cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 GGML_NATIVE=1
Also, if I want to put more than 32k context, to have room for thinking, is it enough to increase --ctx-size or should I add some yanr params (same question for the V3)?
Thank you again for you work! You are really helpful!
Hi, thank you for your work, I am happy to be able to run your version of deepseek v3 q2 at more than expected and quite usable speed and response quality
Hey thanks for the llama-sweep-bench report! Very cool you're getting usable numbers with a quality quant! I have a few questions about your command, and you might be able to get a little more as I can get about 15 tok/sec with that quant on a newer 7965WX 24-Cores with ddr5@4800.
- Why are you offloading only
ffn_(gate|up)
to GPU and not including thedown
? I believefmoe
fuses some of those and you might be better offloading full layers but less of them. You'll have to experiment to confirm though. - How did you come to the unique number
5888
for batch/ubatch? I haven't played around with those much but some have reported improved speeds with higher batch numbers for multi-gpu setups. I assume you empirically discovered that value, but am just curious? - This is a long shot, but possibly try
--threads 16 --threads-batch 24
assuming SMT enabled just to see if you can eek out a little more PP. Probably not but it is something I'm curious about works on giant intel xeons but probably not smaller 16 core systems.
will it be worth q3 as quality increase against q2
The best quant is the quant you actually use. I personally would probably choose the speed given this is a thinking model. But if you find it is not able to do the tasks you want, only then consider upgrading to bigger model at the cost of speed. Its all trade-offs yeah...
can i squeeze something more from the builing params
You're doing pretty good already. I honestly avoid most build params as they don't seem to ever help and possibly can hurt. I keep it simple and only use what you show. I don't even use GGML_NATIVE and I don't even know what that does or if it is already default haha..
Also, if I want to put more than 32k context, to have room for thinking, is it enough to increase --ctx-size or should I add some yanr params
This model can go to 160k context natively just increase --ctx-size 655552
and it will probably still fit in under 24GB VRAM as MLA is very efficient and scales linearly as opposed to exponentially like normal GQA etc... I personally always avoid yarn. I know there are some "128k" GGUFs for Qwen3-30B-A3B but unless you know for sure that all your prompts will be in that ~100k range you will actually get degredaded performance on shorter prompts. Even the qwen model card says this and this is why they made yarn off by default. They even say only use yarn 2x if you need ~64k prompt lengths. Everything is trade-offs and only use yarn if you know you really need it. Check out this benchmark I did where both the unsloth 128k 4x yarn default GGUFs are noticeably worse for 2k context size testing:
There is no such thing as a free lunch in LLM world lmao... A lot of time you can get more smart about your prompts and keep everything in under 64k pretty easily unless you are doing very specific things... Even then I'd consider RAG or other ways to keep your prompts shorter and sweeter as performance with very long context just isn't so good in my own experience. ymmv.
Cheers!
Thank you for your answers!
- tested and you are right, it increased like 2-3% the tg speed
- by testing..
- i have smt on, but it made no difference
Just fyi, DGGML_CUDA_IQK_FORCE_BF16=1 gave me 20% increase in pp speed in your Deepseek v3 ik2 quant 😀
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
5120 | 1280 | 0 | 23.645 | 216.53 | 158.836 | 8.06 |
BUT, your DeepSeek-R1-0528 ik2 works only without "DGGML_CUDA_IQK_FORCE_BF16=1", if I add it it throws:
/home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:1286: GGML_ASSERT(to_bf16_cuda != nullptr) failed
Could not attach to process. If your uid matches the uid of the target
Thanks for the feedback and you must have done a lot of testing to arrive at that batch number, I'd never thought to try one that size! I'll have to dig at that some more one day to optimize my own setup more!
BUT, your DeepSeek-R1-0528 ik2 works only without "DGGML_CUDA_IQK_FORCE_BF16=1", if I add it it throws:
/home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:1286: GGML_ASSERT(to_bf16_cuda != nullptr) failed
Could not attach to process. If your uid matches the uid of the target
Ahh, I did see that error for the first time last night myself. I recompiled and it seemed like it went away, so I'm not 100% what is going on there. If I can find a way to repeat it I'll open an issue on ik_llama.cpp. Also interesting the bf16 is faster than fp16 for you and you didn't have any issues as the recent PR#461 discusses it.
EDIT: Also some interesting notes that might help you optimize performance more when offloading more layers to VRAM: https://github.com/ikawrakow/ik_llama.cpp/issues/474#issuecomment-2924248648
I just did some speed benchmarks and am also seeing some boost to PP as you describe: https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13335019