testing
W790E Sage + QYFS + 512G + RTX5090
IQ5_K:
Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 457968.00 MiB
llm_load_tensors: CUDA_Host buffer size = 731.86 MiB
llm_load_tensors: CUDA0 buffer size = 17536.89 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 80128
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 2852.79 MiB
llama_new_context_with_model: KV self size = 2852.76 MiB, c^KV (q8_0): 2852.76 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 8777.25 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1363.88 MiB
llama_new_context_with_model: graph nodes = 24349
llama_new_context_with_model: graph splits = 118
main: n_kv_max = 80128, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4090 | 1022 | 0 | 50.519 | 80.96 | 122.953 | 8.31 |
4090 | 1022 | 4090 | 50.857 | 80.42 | 156.163 | 6.54 |
4090 | 1022 | 8180 | 51.420 | 79.54 | 163.298 | 6.26 |
4090 | 1022 | 12270 | 52.924 | 77.28 | 164.144 | 6.23 |
4090 | 1022 | 16360 | 54.806 | 74.63 | 165.998 | 6.16 |
How can I select thinking or non-thinking?
Thanks! I always appreciate your testing and benchmark reports!
How can I select thinking or non-thinking?
The official documentation explains how to enable/disable thinking by adjusting your chat template here: https://huggingface.co/deepseek-ai/DeepSeek-V3.1#non-thinking
Looking at it closely, you can't just toss in a </nothink>
or <think>
in your prompt, it has to be injected after the assistant block in your template. So if you're using llama-server's /chat/completion/
endpoint like I typically do, it may not be able to select.
If you are using the `/text/completion/ API endpoint and your client is doing the chat template e.g. SillyTavern etc, then you have full control and can correctly apply the mode there.
Unfortunately, its a little tricky unless I'm missing something. Might need to add some new feature to the built-in chat templates to support this with /chat/completion
hrmm...
IQ4_KSS:
Computed blk.60.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 101120
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 3600.16 MiB
llama_new_context_with_model: KV self size = 3600.13 MiB, c^KV (q8_0): 3600.13 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 11201.75 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1691.88 MiB
llama_new_context_with_model: graph nodes = 24349
llama_new_context_with_model: graph splits = 118
main: n_kv_max = 101120, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4090 | 1022 | 0 | 36.118 | 113.24 | 88.844 | 11.50 |
4090 | 1022 | 4090 | 36.784 | 111.19 | 103.610 | 9.86 |
4090 | 1022 | 8180 | 37.915 | 107.87 | 123.166 | 8.30 |
4090 | 1022 | 12270 | 38.887 | 105.18 | 93.315 | 10.95 |
4090 | 1022 | 16360 | 48.055 | 85.11 | 94.496 | 10.82 |
How do you find the quality output of DeepSeek V3.1? I noticed that thinking
mode is much better for vibe-coding. It's truly incredible what this model can do. I was able to make DeepSeek V3.1 produce over 3k lines while continuously making meaningful forward progress.
If you ask it to analyze and design in detail first, it will enter architect mode. Another good method is to ask it to create test scenarios and then create logs for analysis.
How do you find the quality output of DeepSeek V3.1? I noticed that
thinking
mode is much better for vibe-coding. It's truly incredible what this model can do. I was able to make DeepSeek V3.1 produce over 3k lines while continuously making meaningful forward progress.
So happy to hear that!
I've finally managed to rebuild the server and I cannot wait to try this one out.