qwen3 coder test

#5
by shewin - opened

W790E Sage + QYFS + 512G + RTX5090

IQ5_K:

llm_load_tensors: offloading 62 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 63/63 layers to GPU
llm_load_tensors: CPU buffer size = 41600.58 MiB
llm_load_tensors: CPU buffer size = 43386.38 MiB
llm_load_tensors: CPU buffer size = 43386.38 MiB
llm_load_tensors: CPU buffer size = 43386.38 MiB
llm_load_tensors: CPU buffer size = 43386.38 MiB
llm_load_tensors: CPU buffer size = 43386.38 MiB
llm_load_tensors: CPU buffer size = 43386.38 MiB
llm_load_tensors: CPU buffer size = 34927.03 MiB
llm_load_tensors: CPU buffer size = 737.24 MiB
llm_load_tensors: CUDA0 buffer size = 9156.73 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 98304
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 12648.03 MiB
llama_new_context_with_model: KV self size = 12648.00 MiB, K (q8_0): 6324.00 MiB, V (q8_0): 6324.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 6837.52 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1632.05 MiB
llama_new_context_with_model: graph nodes = 2424
llama_new_context_with_model: graph splits = 126

main: n_kv_max = 98304, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 36.147 113.32 91.782 11.16
4096 1024 4096 36.396 112.54 96.775 10.58
4096 1024 8192 38.433 106.58 111.770 9.16
4096 1024 12288 38.772 105.64 119.940 8.54
4096 1024 16384 39.167 104.58 123.763 8.27

2025-07-24_14-51.png

We have a similar setup. I'm using an MS73-HB1 with 2xQYFS and a 5090. Here are my results:

$ numactl --interleave=all build/bin/llama-sweep-bench \
  -m /mnt/data2/models/IK_GGUF/ubergarm_Qwen3-Coder-480B-A35B-Instruct-GGUF/IQ5_K/Qwen3-480B-A35B-Instruct-IQ5_K-00001-of-00008.gguf \
  -ngl 99 \
  -ot "blk.*\.ffn.*=CPU" \
  -t 56 \
  --threads-batch 112 \
  -ub 4096 \
  -b 4096 \
  --numa distribute \
  -fa \
  -fmoe \
  -ctk q8_0 \
  -ctv q8_0 \
  -c 98304 \
  -op 27,0,29,0
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 16.767 244.29 98.516 10.39
4096 1024 4096 16.775 244.17 102.692 9.97
4096 1024 8192 17.442 234.84 105.049 9.75
4096 1024 12288 17.684 231.63 108.202 9.46
4096 1024 16384 18.003 227.52 109.467 9.35

You may want to try adding -op 27,0,29,0 to see if it helps PP. I believe my TG is lower due to NUMA overhead.

with -op 27,0,29,0
better prompt processing

Thanks for your help.

XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF

main: n_kv_max = 98304, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 27.414 149.41 91.496 11.19
4096 1024 4096 26.671 153.57 95.829 10.69
4096 1024 8192 26.742 153.17 106.375 9.63
4096 1024 12288 27.338 149.83 116.544 8.79
4096 1024 16384 30.521 134.20 123.362 8.30

my full options:
-fa
-ctk q8_0 -ctv q8_0
-c 98304
-b 4096 -ub 4096
-fmoe
-amb 512
--override-tensor exps=CPU
-ngl 99
--threads 101
--flash-attn
-op 27,0,29,0 \

slower than kimi k2
but seems better

2025-07-24_17-35.png

I'm locally coding with this model in RooCode , this model is absolutely thinking model.

hey guys why q2ks not doing best while browsing ? its just gets stucked in between every time

Sign up or log in to comment