testing

by shewin - opened 17 days ago

17 days ago

W790E Sage + QYFS + 512G + RTX5090

IQ5_K:

Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 457968.00 MiB
llm_load_tensors: CUDA_Host buffer size = 731.86 MiB
llm_load_tensors: CUDA0 buffer size = 17536.89 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 80128
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 2852.79 MiB
llama_new_context_with_model: KV self size = 2852.76 MiB, c^KV (q8_0): 2852.76 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 8777.25 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1363.88 MiB
llama_new_context_with_model: graph nodes = 24349
llama_new_context_with_model: graph splits = 118

main: n_kv_max = 80128, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4090	1022	0	50.519	80.96	122.953	8.31
4090	1022	4090	50.857	80.42	156.163	6.54
4090	1022	8180	51.420	79.54	163.298	6.26
4090	1022	12270	52.924	77.28	164.144	6.23
4090	1022	16360	54.806	74.63	165.998	6.16

shewin

17 days ago

How can I select thinking or non-thinking?

shewin

16 days ago

It took a long time

ubergarm

Owner 16 days ago

@shewin

Thanks! I always appreciate your testing and benchmark reports!

How can I select thinking or non-thinking?

The official documentation explains how to enable/disable thinking by adjusting your chat template here: https://huggingface.co/deepseek-ai/DeepSeek-V3.1#non-thinking

Looking at it closely, you can't just toss in a </nothink> or <think> in your prompt, it has to be injected after the assistant block in your template. So if you're using llama-server's /chat/completion/ endpoint like I typically do, it may not be able to select.

If you are using the `/text/completion/ API endpoint and your client is doing the chat template e.g. SillyTavern etc, then you have full control and can correctly apply the mode there.

Unfortunately, its a little tricky unless I'm missing something. Might need to add some new feature to the built-in chat templates to support this with /chat/completion hrmm...

shewin

15 days ago

without -ctv q8_0, I can continue beyond the context size

shewin

15 days ago

IQ4_KSS:

Computed blk.60.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 101120
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 3600.16 MiB
llama_new_context_with_model: KV self size = 3600.13 MiB, c^KV (q8_0): 3600.13 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 11201.75 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1691.88 MiB
llama_new_context_with_model: graph nodes = 24349
llama_new_context_with_model: graph splits = 118

main: n_kv_max = 101120, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4090	1022	0	36.118	113.24	88.844	11.50
4090	1022	4090	36.784	111.19	103.610	9.86
4090	1022	8180	37.915	107.87	123.166	8.30
4090	1022	12270	38.887	105.18	93.315	10.95
4090	1022	16360	48.055	85.11	94.496	10.82

anikifoss

12 days ago

How do you find the quality output of DeepSeek V3.1? I noticed that thinking mode is much better for vibe-coding. It's truly incredible what this model can do. I was able to make DeepSeek V3.1 produce over 3k lines while continuously making meaningful forward progress.

shewin

12 days ago

If you ask it to analyze and design in detail first, it will enter architect mode. Another good method is to ask it to create test scenarios and then create logs for analysis.

sousekd

3 days ago

How do you find the quality output of DeepSeek V3.1? I noticed that thinking mode is much better for vibe-coding. It's truly incredible what this model can do. I was able to make DeepSeek V3.1 produce over 3k lines while continuously making meaningful forward progress.

So happy to hear that!
I've finally managed to rebuild the server and I cannot wait to try this one out.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment