qwen3 2507 test

by shewin - opened 4 days ago

4 days ago

W790E Sage + QYFS + 512G + RTX5090

IQ5_K:

llm_load_tensors: offloading 94 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors: CPU buffer size = 40915.49 MiB
llm_load_tensors: CPU buffer size = 41335.49 MiB
llm_load_tensors: CPU buffer size = 41443.49 MiB
llm_load_tensors: CPU buffer size = 41298.98 MiB
llm_load_tensors: CPU buffer size = 491.49 MiB
llm_load_tensors: CUDA0 buffer size = 6064.04 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 163840
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 15980.05 MiB
llama_new_context_with_model: KV self size = 15980.00 MiB, K (q8_0): 7990.00 MiB, V (q8_0): 7990.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 4224.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 2624.05 MiB
llama_new_context_with_model: graph nodes = 3672
llama_new_context_with_model: graph splits = 190

main: n_kv_max = 163840, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	18.884	216.90	68.618	14.92
4096	1024	4096	18.959	216.05	70.094	14.61
4096	1024	8192	19.257	212.70	75.149	13.63
4096	1024	12288	19.597	209.01	75.659	13.53
4096	1024	16384	19.850	206.35	81.122	12.62

shewin

4 days ago

BernardH

4 days ago

•

edited 4 days ago

W790E Sage + QYFS + 512G + RTX5090

When model doesn't fit in VRAM, it's useful to give nb of memory channels and RAM speed.

JeroenAdam

4 days ago

•

edited 4 days ago

I tried IQ2_KL with this release https://github.com/Thireus/ik_llama.cpp/releases/tag/main-b3953-87fd730 from 3 days ago, the output is consistent but unrelated rambling when using -ot parameter.
2x Nvidia 16GB VRAM + 64GB DDR5, works fine when using -ngl only.
No such issue with -ot parameter and Unsloth their Q2_K and mainline llama.cpp (but almost no speed gain).
I'll try more later.

ubergarm

Owner 4 days ago

Thanks all, sounds like some folks are having luck with the first quant uploaded. I'm adding some more now in a variety of sizes with solid looking perplexity (comparisons coming as I finish up thanks!).

@JeroenAdam you can now use the tip of main from ik_llama.cpp again which has a fix for some speed after , issues which also sped up the first round of chat in my anecdotal experience. Please update to that and provide your full command here and we can see if there is anything to optimize for you.

Thanks!

ubergarm

Owner 4 days ago

Also check out the newer bigger coding one: https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF

that is the one i'm talking about for more quants on the way

JeroenAdam

4 days ago

•

edited 4 days ago

I tested again with -ot parameter: latest release https://github.com/Thireus/ik_llama.cpp/releases/tag/main-b3964-607e01e from 1h ago and this time I used the Unsloth's 80GB Q2_K quant https://huggingface.co/unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF/tree/main/Q2_K so that I can compare ik_llama and mainline llama.cpp.

The issue remains with ik_llama: the output is mostly consistent but unrelated rambling, sometimes a single character repeated.
I tried lowering context to make sure that there was around 1GB free VRAM on each 16GB GPU, that did not help. No such issue with llama.cpp.
The first command below is llama.cpp, the second command ik_llama.

llama-cli -m .\Qwen3-235B-A22B-Instruct-2507-Q2_K-00001-of-00002.gguf -ngl 99 -fa -c 30720 -ctk q8_0 -ctv q8_0 --main-gpu 0 -ot ".ffn_(up|down)_exps.=CPU" -t 10 --temp 0.1 -sys "You are a Java Developer." -ts 0.95,1

Reminder: without -ot parameter and doing standard offloading with -ngl, no such issue.

ubergarm

Owner 3 days ago

@JeroenAdam

Interesting that you get different behavior depending on how you're offloading which shouldn't be the case (besides just speed). Thanks for the report, if I hear anyone else saying I'll try to connect you. I've not noticed it but I'm not using your exact -ot commands.
Keep in mind Qwen3 has some more ffn's besides just the exps if you look at the differences between the tensor names in DeepSeek and Qwen3. Hence why I was using ffn.* type thing in the model card to try to keep all the tensors together per layer on either CPU or GPU.

What is your CPU, RAM, and GPU configuration and how are you compiling? it almost sounds like some kind of issue with numerical instability if you're getting single character repeated, especially if that character is DDDDDDD which means inf which is nan which means numerical instability. Just speculating, hopefully can figure it out!

JeroenAdam

3 days ago

•

edited 3 days ago

2x 16GB Nvidia (RTX 5060 Ti & P5000) + 64 GB DDR5 and Intel i5-13400F
I downloaded this release: https://github.com/Thireus/ik_llama.cpp/releases/download/main-b3964-607e01e/ik_llama-main-b3964-607e01e-bin-win-cuda-12.8-x64-avx2.zip
I will paste here below the console output of both ik_llama.cpp and llama.cpp, maybe you'll find something.
In my latest test with ik_llama the model did pick up my question was related to Java and it started coherent output but totally unrelated to my question.
To exclude VRAM issues I defined context as 4K although 30K~32K should work as it does on my machine with mainline llama.cpp.

C:\Users\Administrator\Downloads\ik_llama>llama-cli -m .\Qwen3-235B-A22B-Instruct-2507-Q2_K-00001-of-00002.gguf -ngl 99 -ot ".ffn_(up|down)_exps.=CPU" -fa -c 4096 -ctk q8_0 -ctv q8_0 --main-gpu 0 -t 10 --temp 0.1 -i -cnv -p "You are a Java Developer. Assure the use of a service class and skip equals, hashCode and toString in the entity classes. Do NOT output any .java or .properties or .xml file when nothing changed. Do NOT include the step 'test the application', 'build the application' and 'run the application'. Skip mentioning the project structure, curl or Postman." -ts 0.9,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
  Device 1: Quadro P5000, compute capability 6.1, VMM: yes
Log start
main: build = 1 (607e01e)
main: built with MSVC 19.44.35213.0 for
main: seed  = 1753394136
llama_model_loader: max stdio successfully set to 2048
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 48 key-value pairs and 1131 tensors from .\Qwen3-235B-A22B-Instruct-2507-Q2_K-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-235B-A22B-Instruct-2507
llama_model_loader: - kv   3:                            general.version str              = 2507
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Qwen3-235B-A22B-Instruct-2507
llama_model_loader: - kv   6:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   7:                         general.size_label str              = 235B-A22B
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv  10:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  11:                   general.base_model.count u32              = 1
llama_model_loader: - kv  12:                  general.base_model.0.name str              = Qwen3 235B A22B Instruct 2507
llama_model_loader: - kv  13:               general.base_model.0.version str              = 2507
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  17:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv  18:                    qwen3moe.context_length u32              = 262144
llama_model_loader: - kv  19:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  20:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  21:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  22:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  23:                    qwen3moe.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  24:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  26:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  27:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  28:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  29:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  30:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  31:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  33:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  34:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  35:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  36:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  37:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  38:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                          general.file_type u32              = 10
llama_model_loader: - kv  41:                      quantize.imatrix.file str              = Qwen3-235B-A22B-Instruct-2507-GGUF/im...
llama_model_loader: - kv  42:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-235B-A22B-I...
llama_model_loader: - kv  43:             quantize.imatrix.entries_count u32              = 745
llama_model_loader: - kv  44:              quantize.imatrix.chunks_count u32              = 693
llama_model_loader: - kv  45:                                   split.no u16              = 0
llama_model_loader: - kv  46:                        split.tensors.count i32              = 1131
llama_model_loader: - kv  47:                                split.count u16              = 2
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q2_K:  377 tensors
llama_model_loader: - type q3_K:  188 tensors
llama_model_loader: - type q4_K:   94 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen3moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 262144
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 94
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 128
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 5000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 262144
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 235.094 B
llm_load_print_meta: model size       = 79.800 GiB (2.916 BPW)
llm_load_print_meta: repeating layers = 79.135 GiB (2.907 BPW, 233.849 B parameters)
llm_load_print_meta: general.name     = Qwen3-235B-A22B-Instruct-2507
llm_load_print_meta: BOS token        = 11 ','
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151654 '<|vision_pad|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 1536
llm_load_tensors: ggml ctx size =    1.49 MiB
Tensor blk.0.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.0.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.1.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.1.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.2.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.2.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.61.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.61.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.62.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.62.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.63.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.63.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.64.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.64.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.65.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.65.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.66.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.66.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.67.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.67.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.68.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.68.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.69.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.69.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.70.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.70.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.71.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.71.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.72.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.72.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.73.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.73.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.74.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.74.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.75.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.75.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.76.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.76.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.77.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.77.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.78.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.78.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.79.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.79.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.80.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.80.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.81.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.81.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.82.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.82.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.83.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.83.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.84.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.84.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.85.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.85.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.86.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.86.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.87.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.87.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.88.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.88.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.89.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.89.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.90.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.90.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 94 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors:        CPU buffer size = 46881.43 MiB
llm_load_tensors:        CPU buffer size = 33872.48 MiB
llm_load_tensors:        CPU buffer size =   194.74 MiB
llm_load_tensors:      CUDA0 buffer size = 12602.86 MiB
llm_load_tensors:      CUDA1 buffer size = 14209.98 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 0
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   191.27 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   208.27 MiB
llama_new_context_with_model: KV self size  =  399.50 MiB, K (q8_0):  199.75 MiB, V (q8_0):  199.75 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =   590.25 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   312.75 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 3860
llama_new_context_with_model: graph splits = 240
main: chat template example: <|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 10 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.100
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
        xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000
sampling order:
CFG -> Penalties -> dry -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> xtc -> top_n_sigma -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

system
You are a Java Developer. Assure the use of a service class and skip equals, hashCode and toString in the entity classes. Do NOT output any .java or .properties or .xml file when nothing changed. Do NOT include the step 'test the application', 'build the application' and 'run the application'. Skip mentioning the project structure, curl or Postman.

>

C:\Users\Administrator\Downloads\llama>llama-cli -m .\Qwen3-235B-A22B-Instruct-2507-Q2_K-00001-of-00002.gguf -ngl 99 -fa -c 30720 -ctk q8_0 -ctv q8_0 --main-gpu 0 -ot ".ffn_(up|down)_exps.=CPU" -t 10 --temp 0.1 -sys "You are a Java Developer. Assure the use of a service class and skip equals, hashCode and toString in the entity classes. Do NOT output any .java or .properties or .xml file when nothing changed. Do NOT include the step 'test the application', 'build the application' and 'run the application'. Skip mentioning the project structure, curl or Postman." -ts 0.95,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
  Device 1: Quadro P5000, compute capability 6.1, VMM: yes
load_backend: loaded CUDA backend from C:\Users\Administrator\Downloads\llama\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Users\Administrator\Downloads\llama\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\Administrator\Downloads\llama\ggml-cpu-alderlake.dll
build: 5985 (3f4fc97f) with clang version 19.1.5 for x86_64-pc-windows-msvc
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5060 Ti) - 15072 MiB free
llama_model_load_from_file_impl: using device CUDA1 (Quadro P5000) - 15363 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 48 key-value pairs and 1131 tensors from .\Qwen3-235B-A22B-Instruct-2507-Q2_K-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-235B-A22B-Instruct-2507
llama_model_loader: - kv   3:                            general.version str              = 2507
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Qwen3-235B-A22B-Instruct-2507
llama_model_loader: - kv   6:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   7:                         general.size_label str              = 235B-A22B
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv  10:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  11:                   general.base_model.count u32              = 1
llama_model_loader: - kv  12:                  general.base_model.0.name str              = Qwen3 235B A22B Instruct 2507
llama_model_loader: - kv  13:               general.base_model.0.version str              = 2507
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  17:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv  18:                    qwen3moe.context_length u32              = 262144
llama_model_loader: - kv  19:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  20:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  21:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  22:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  23:                    qwen3moe.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  24:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  26:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  27:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  28:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  29:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  30:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  31:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  33:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  34:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  35:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  36:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  37:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  38:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                          general.file_type u32              = 10
llama_model_loader: - kv  41:                      quantize.imatrix.file str              = Qwen3-235B-A22B-Instruct-2507-GGUF/im...
llama_model_loader: - kv  42:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-235B-A22B-I...
llama_model_loader: - kv  43:             quantize.imatrix.entries_count u32              = 745
llama_model_loader: - kv  44:              quantize.imatrix.chunks_count u32              = 693
llama_model_loader: - kv  45:                                   split.no u16              = 0
llama_model_loader: - kv  46:                        split.tensors.count i32              = 1131
llama_model_loader: - kv  47:                                split.count u16              = 2
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q2_K:  377 tensors
llama_model_loader: - type q3_K:  188 tensors
llama_model_loader: - type q4_K:   94 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q2_K - Medium
print_info: file size   = 79.80 GiB (2.92 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 262144
print_info: n_embd           = 4096
print_info: n_layer          = 94
print_info: n_head           = 64
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 16
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 262144
print_info: rope_finetuned   = unknown
print_info: model type       = 235B.A22B
print_info: model params     = 235.09 B
print_info: general.name     = Qwen3-235B-A22B-Instruct-2507
print_info: n_ff_exp         = 1536
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 94 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 95/95 layers to GPU
load_tensors:        CUDA0 model buffer size = 13162.98 MiB
load_tensors:        CUDA1 model buffer size = 13649.85 MiB
load_tensors:   CPU_Mapped model buffer size = 47102.22 MiB
load_tensors:   CPU_Mapped model buffer size = 33872.48 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 30720
llama_context: n_ctx_per_seq = 30720
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: kv_unified    = true
llama_context: freq_base     = 5000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (30720) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =  1498.12 MiB
llama_kv_cache_unified:      CUDA1 KV buffer size =  1498.12 MiB
llama_kv_cache_unified: size = 2996.25 MiB ( 30720 cells,  94 layers,  1/ 1 seqs), K (q8_0): 1498.12 MiB, V (q8_0): 1498.12 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:      CUDA0 compute buffer size =   590.25 MiB
llama_context:      CUDA1 compute buffer size =   304.75 MiB
llama_context:  CUDA_Host compute buffer size =    68.01 MiB
llama_context: graph nodes  = 6023
llama_context: graph splits = 238 (with bs=512), 190 (with bs=1)
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 30720
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 10
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 10 (n_threads_batch = 10) / 16 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: interactive mode on.
sampler seed: 2118757757
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 30720
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.100
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 30720, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

system
You are a Java Developer. Assure the use of a service class and skip equals, hashCode and toString in the entity classes. Do NOT output any .java or .properties or .xml file when nothing changed. Do NOT include the step 'test the application', 'build the application' and 'run the application'. Skip mentioning the project structure, curl or Postman.

>

ubergarm

Owner 2 days ago

@JeroenAdam

I see you are using windows with multi-GPU which some people have suggested might not utilize the resources as well as Linux. But specific to your case on of your GPUs is older Device 1: Quadro P5000, compute capability 6.1, VMM: yes

So my initial thoughts:

Try adding -fmoe to the ik_llama.cpp command (i believe that was the first output you pasted). Not sure your older GPU supports it though.
Try using CUDA_VISIBLE_DEVICES=0 llama-cli ... to remove your older Quadro P5000 from the mix just in case.
Otherwise possibly @Thireus may have more experience on the windows side and have seen a similar issue before.

Thireus

2 days ago

@JeroenAdam as a first troubleshooting step I'd suggest to not offload anything to the Quadro P5000 and see if it still spits nonsense as @ubergarm mentioned. It could be that the compiled version doesn't fully support down to 6.1. But I've never seen this behaviour, so not sure I'll be able to help.

JeroenAdam

about 20 hours ago

•

edited about 20 hours ago

FYI, I removed -fa -ctk -ctv and now it works fine. But at 6 t/s compared to 6.6 t/s with mainline llama.cpp
Without the P5000 I had insufficient memory to test Qwen3 235b moe so I tried Qwen3 30b moe and that goed well, with -fa and without, with P5000 or without.
-fmoe made no difference

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment