testing

#2
by shewin - opened

W790E Sage + QYFS + 512G + RTX5090

IQ5_K:

Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 457968.00 MiB
llm_load_tensors: CUDA_Host buffer size = 731.86 MiB
llm_load_tensors: CUDA0 buffer size = 17536.89 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 80128
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 2852.79 MiB
llama_new_context_with_model: KV self size = 2852.76 MiB, c^KV (q8_0): 2852.76 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 8777.25 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1363.88 MiB
llama_new_context_with_model: graph nodes = 24349
llama_new_context_with_model: graph splits = 118

main: n_kv_max = 80128, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4090 1022 0 50.519 80.96 122.953 8.31
4090 1022 4090 50.857 80.42 156.163 6.54
4090 1022 8180 51.420 79.54 163.298 6.26
4090 1022 12270 52.924 77.28 164.144 6.23
4090 1022 16360 54.806 74.63 165.998 6.16

How can I select thinking or non-thinking?

2025-08-23_22-56.png
It took a long time

@shewin

Thanks! I always appreciate your testing and benchmark reports!

How can I select thinking or non-thinking?

The official documentation explains how to enable/disable thinking by adjusting your chat template here: https://huggingface.co/deepseek-ai/DeepSeek-V3.1#non-thinking

Looking at it closely, you can't just toss in a </nothink> or <think> in your prompt, it has to be injected after the assistant block in your template. So if you're using llama-server's /chat/completion/ endpoint like I typically do, it may not be able to select.

If you are using the `/text/completion/ API endpoint and your client is doing the chat template e.g. SillyTavern etc, then you have full control and can correctly apply the mode there.

Unfortunately, its a little tricky unless I'm missing something. Might need to add some new feature to the built-in chat templates to support this with /chat/completion hrmm...

2025-08-23_15-25.png
without -ctv q8_0, I can continue beyond the context size

IQ4_KSS:

Computed blk.60.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 101120
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 3600.16 MiB
llama_new_context_with_model: KV self size = 3600.13 MiB, c^KV (q8_0): 3600.13 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 11201.75 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1691.88 MiB
llama_new_context_with_model: graph nodes = 24349
llama_new_context_with_model: graph splits = 118

main: n_kv_max = 101120, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4090 1022 0 36.118 113.24 88.844 11.50
4090 1022 4090 36.784 111.19 103.610 9.86
4090 1022 8180 37.915 107.87 123.166 8.30
4090 1022 12270 38.887 105.18 93.315 10.95
4090 1022 16360 48.055 85.11 94.496 10.82

How do you find the quality output of DeepSeek V3.1? I noticed that thinking mode is much better for vibe-coding. It's truly incredible what this model can do. I was able to make DeepSeek V3.1 produce over 3k lines while continuously making meaningful forward progress.

If you ask it to analyze and design in detail first, it will enter architect mode. Another good method is to ask it to create test scenarios and then create logs for analysis.

How do you find the quality output of DeepSeek V3.1? I noticed that thinking mode is much better for vibe-coding. It's truly incredible what this model can do. I was able to make DeepSeek V3.1 produce over 3k lines while continuously making meaningful forward progress.

So happy to hear that!
I've finally managed to rebuild the server and I cannot wait to try this one out.

Sign up or log in to comment