Kimi K2 v0.2 test
W790E Sage + QYFS + 512G + RTX5090
with -op 27,0,29,0
Thanks Kebob
IQ3_KS:
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 43751.77 MiB
llm_load_tensors: CPU buffer size = 43751.77 MiB
llm_load_tensors: CPU buffer size = 43751.77 MiB
llm_load_tensors: CPU buffer size = 43751.77 MiB
llm_load_tensors: CPU buffer size = 43751.77 MiB
llm_load_tensors: CPU buffer size = 43751.77 MiB
llm_load_tensors: CPU buffer size = 43751.77 MiB
llm_load_tensors: CPU buffer size = 43751.77 MiB
llm_load_tensors: CPU buffer size = 43751.77 MiB
llm_load_tensors: CPU buffer size = 44679.30 MiB
llm_load_tensors: CPU buffer size = 630.00 MiB
llm_load_tensors: CUDA0 buffer size = 11409.44 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 163840
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 50000.0
llama_new_context_with_model: freq_scale = 0.03125
llama_kv_cache_init: CUDA0 KV buffer size = 5833.15 MiB
llama_new_context_with_model: KV self size = 5833.12 MiB, c^KV (q8_0): 5833.12 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 10943.16 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 2671.88 MiB
llama_new_context_with_model: graph nodes = 24387
llama_new_context_with_model: graph splits = 122
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF
main: n_kv_max = 163840, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4090 | 1022 | 0 | 26.486 | 154.42 | 70.008 | 14.60 |
4090 | 1022 | 4090 | 27.707 | 147.62 | 90.426 | 11.30 |
4090 | 1022 | 8180 | 28.677 | 142.62 | 84.301 | 12.12 |
4090 | 1022 | 12270 | 28.836 | 141.84 | 92.174 | 11.09 |
4090 | 1022 | 16360 | 28.711 | 142.45 | 86.182 | 11.86 |