ubergarm's picture
Update links to croco.cpp and Thireus builds.
3df769d
metadata
quantized_by: ubergarm
pipeline_tag: text-generation
base_model: moonshotai/Kimi-K2-Instruct-0905
license: other
license_name: modified-mit
license_link: https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905/blob/main/LICENSE
base_model_relation: quantized
tags:
  - mla
  - imatrix
  - conversational
  - ik_llama.cpp

ik_llama.cpp imatrix Quantizations of moonshotai/Kimi-K2-Instruct-0905

This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP. For pre-built Windows binaries of ik_llama.cpp check out Thireus' fork here.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!

Notes

  • The current imatrix dat file seems to be missing entries for just the single dense layer and shared expert so all my recipes are using q8_0 for those.
  • For notes on tool calling api endpoints checkout details from this PR: https://github.com/ikawrakow/ik_llama.cpp/pull/723
  • smol here simply means the routed experts recipe uses the same quantization for down as well as (gate|up) tensors.

Quant Collection

Compare with baseline perplexity of full size Q8_0 1016.117 GiB (8.504 BPW)

Final estimate: PPL = 2.4443 +/- 0.01175

Perplexity Chart

smol-IQ5_KS 632.664 GiB (5.295 BPW)

Final estimate: PPL = 2.4526 +/- 0.01182

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_ks

## Token embedding and output tensors (GPU)
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-smol-IQ5_KS.gguf \
    IQ5_KS \
    192

smol-IQ4_KSS 485.008 GiB (4.059 BPW)

Final estimate: PPL = 2.5185 +/- 0.01221

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq4_kss
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

## Token embedding and output tensors (GPU)
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-smol-IQ4_KSS.gguf \
    IQ4_KSS \
    192

IQ4_KS 553.624 GiB (4.633 BPW)

Final estimate: PPL = 2.4641 +/- 0.01190

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_ks

## Token embedding and output tensors (GPU)
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 1 -m 1 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-IQ4_KS.gguf \
    IQ4_KS \
    192

IQ3_KS 420.558 GiB (3.520 BPW)

Final estimate: PPL = 2.5640 +/- 0.01262

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq4_kss
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks

## Token embedding and output tensors (GPU)
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-IQ3_KS.gguf \
    IQ3_KS \
    192

smol-IQ3_KS 388.258 GiB (3.249 BPW)

Final estimate: PPL = 2.5902 +/- 0.01284

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks

## Token embedding and output tensors (GPU)
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-smol-IQ3_KS.gguf \
    IQ3_KS \
    192

IQ2_KL 358.419 GiB (3.000 BPW)

Final estimate: PPL = 2.7993 +/- 0.01416

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq3_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

## Token embedding and output tensors (GPU)
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-IQ2_KL.gguf \
    IQ2_KL \
    192

smol-IQ2_KL 329.195 GiB (2.755 BPW)

Final estimate: PPL = 2.9294 +/- 0.01499

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq2_kl
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

## Token embedding and output tensors (GPU)
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 1 -m 1 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-smol-IQ2_KL.gguf \
    IQ2_KL \
    192

IQ2_KS 289.820 GiB (2.425 BPW)

Final estimate: PPL = 3.2478 +/- 0.01721

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq2_kl
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_ks

## Token embedding and output tensors (GPU)
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 1 -m 1 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-IQ2_KS.gguf \
    IQ2_KS \
    192

smol-IQ2_KS 270.133 GiB (2.261 BPW)

Final estimate: PPL = 3.4977 +/- 0.01924

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq2_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_ks

## Token embedding and output tensors (GPU)
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-smol-IQ2_KS.gguf \
    IQ2_KS \
    192

smol-IQ1_KT 218.936 GiB (1.832 BPW)

Final estimate: PPL = 4.2224 +/- 0.02443

πŸ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq1_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt

## Token embedding and output tensors (GPU)
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-smol-IQ1_KT.gguf \
    IQ1_KT \
    192

Example Commands

Hybrid (multiple) CUDA + CPU

# Two CUDA devices with enough VRAM to offload more layers
# Keep in mind Kimi-K2 starts at 1 unlike DeepSeek at 3 (first dense layers)
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/Kimi-K2-Instruct-0905 \
    --ctx-size 32768 \
    -ctk q8_0 \
    -fa -fmoe \
    -mla 3 \
    -ngl 99 \
    -ot "blk\.(1|2|3)\.ffn_.*=CUDA0" \
    -ot "blk\.(4|5|6)\.ffn_.*=CUDA1" \
    -ot exps=CPU \
    --parallel 1 \
    --threads 48 \
    --threads-batch 64 \
    --host 127.0.0.1 \
    --port 8080

CPU-Only (no GPU)

# compile
cmake -B build -DGGML_CUDA=0 -DGGML_BLAS=0 -DGGML_VULKAN=0
cmake --build build --config Release -j $(nproc)

# run server
# single CPU of a dual socket rig configured one NUMA per socket
numactl -N 0 -m 0 \
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/Kimi-K2-Instruct-0905 \
    --ctx-size 98304 \
    -ctk q8_0 \
    -fa -fmoe \
    -mla 3 \
    --parallel 1 \
    --threads 128 \
    --threads-batch 192 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080

References