qwen3 coder test
W790E Sage + QYFS + 512G + RTX5090
IQ5_K:
llm_load_tensors: offloading 62 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 63/63 layers to GPU
llm_load_tensors: CPU buffer size = 41600.58 MiB
llm_load_tensors: CPU buffer size = 43386.38 MiB
llm_load_tensors: CPU buffer size = 43386.38 MiB
llm_load_tensors: CPU buffer size = 43386.38 MiB
llm_load_tensors: CPU buffer size = 43386.38 MiB
llm_load_tensors: CPU buffer size = 43386.38 MiB
llm_load_tensors: CPU buffer size = 43386.38 MiB
llm_load_tensors: CPU buffer size = 34927.03 MiB
llm_load_tensors: CPU buffer size = 737.24 MiB
llm_load_tensors: CUDA0 buffer size = 9156.73 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 98304
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 12648.03 MiB
llama_new_context_with_model: KV self size = 12648.00 MiB, K (q8_0): 6324.00 MiB, V (q8_0): 6324.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 6837.52 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1632.05 MiB
llama_new_context_with_model: graph nodes = 2424
llama_new_context_with_model: graph splits = 126
main: n_kv_max = 98304, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4096 | 1024 | 0 | 36.147 | 113.32 | 91.782 | 11.16 |
4096 | 1024 | 4096 | 36.396 | 112.54 | 96.775 | 10.58 |
4096 | 1024 | 8192 | 38.433 | 106.58 | 111.770 | 9.16 |
4096 | 1024 | 12288 | 38.772 | 105.64 | 119.940 | 8.54 |
4096 | 1024 | 16384 | 39.167 | 104.58 | 123.763 | 8.27 |
We have a similar setup. I'm using an MS73-HB1 with 2xQYFS and a 5090. Here are my results:
$ numactl --interleave=all build/bin/llama-sweep-bench \
-m /mnt/data2/models/IK_GGUF/ubergarm_Qwen3-Coder-480B-A35B-Instruct-GGUF/IQ5_K/Qwen3-480B-A35B-Instruct-IQ5_K-00001-of-00008.gguf \
-ngl 99 \
-ot "blk.*\.ffn.*=CPU" \
-t 56 \
--threads-batch 112 \
-ub 4096 \
-b 4096 \
--numa distribute \
-fa \
-fmoe \
-ctk q8_0 \
-ctv q8_0 \
-c 98304 \
-op 27,0,29,0
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4096 | 1024 | 0 | 16.767 | 244.29 | 98.516 | 10.39 |
4096 | 1024 | 4096 | 16.775 | 244.17 | 102.692 | 9.97 |
4096 | 1024 | 8192 | 17.442 | 234.84 | 105.049 | 9.75 |
4096 | 1024 | 12288 | 17.684 | 231.63 | 108.202 | 9.46 |
4096 | 1024 | 16384 | 18.003 | 227.52 | 109.467 | 9.35 |
You may want to try adding -op 27,0,29,0
to see if it helps PP. I believe my TG is lower due to NUMA overhead.
with -op 27,0,29,0
better prompt processing
Thanks for your help.
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF
main: n_kv_max = 98304, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4096 | 1024 | 0 | 27.414 | 149.41 | 91.496 | 11.19 |
4096 | 1024 | 4096 | 26.671 | 153.57 | 95.829 | 10.69 |
4096 | 1024 | 8192 | 26.742 | 153.17 | 106.375 | 9.63 |
4096 | 1024 | 12288 | 27.338 | 149.83 | 116.544 | 8.79 |
4096 | 1024 | 16384 | 30.521 | 134.20 | 123.362 | 8.30 |
my full options:
-fa
-ctk q8_0 -ctv q8_0
-c 98304
-b 4096 -ub 4096
-fmoe
-amb 512
--override-tensor exps=CPU
-ngl 99
--threads 101
--flash-attn
-op 27,0,29,0 \
I'm locally coding with this model in RooCode , this model is absolutely thinking model.
hey guys why q2ks not doing best while browsing ? its just gets stucked in between every time