ik_llama.cpp imatrix Quantizations of Qwen/Qwen3-30B-A3B-Thinking-2507

This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!

Quant Collection

Perplexity computed against wiki.test.raw.

Perplexity Chart

These first three are just test quants for baseline perplexity comparison:

  • bf16 56.894 GiB (16.007 BPW)
    • Final estimate: PPL = 7.3149 +/- 0.05076
  • Q8_0 30.247 GiB (8.510 BPW)
    • Final estimate: PPL = 7.3284 +/- 0.05091
  • Q4_0 16.111 GiB (4.533 BPW)
    • Final estimate: PPL = 7.4534 +/- 0.05151

IQ5_K 21.324 GiB (5.999 BPW)

Final estimate: PPL = 7.3440 +/- 0.05091

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k

# Non-Repeating Layers
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ5_K.gguf \
    IQ5_K \
    192

IQ4_K 17.878 GiB (5.030 BPW)

Final estimate: PPL = 7.3634 +/- 0.05104

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq5_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ4_K.gguf \
    IQ4_K \
    192

IQ4_KSS 15.531 GiB (4.370 BPW)

Final estimate: PPL = 7.3861 +/- 0.05128

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ4_KSS.gguf \
    IQ4_KSS \
    192

IQ4_KT 14.438 GiB (4.062 BPW)

Final estimate: PPL = 7.5020 +/- 0.05230

Mostly pure IQ4_KT meant for full GPU offload similar to turboderp-org/exllamav3 check out ArtusDev's HuggingFace Page for someh excellent EXL3 quants!

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq4_kt
blk\..*\.attn_v.*=iq4_kt
blk\..*\.attn_output.*=iq4_kt

# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq4_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kt

# Non-Repeating Layers
token_embd\.weight=iq4_kt
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ4_KT.gguf \
    IQ4_KT \
    192

IQ3_K 14.509 GiB (4.082 BPW)

Final estimate: PPL = 7.4360 +/- 0.05162

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq4_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ3_K.gguf \
    IQ3_K \
    192

IQ3_KS 13.633 GiB (3.836 BPW)

Final estimate: PPL = 7.4959 +/- 0.05204

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq4_ks
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq4_ks

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ3_KS.gguf \
    IQ3_KS \
    192

IQ2_KL 11.516 GiB (3.240 BPW)

Final estimate: PPL = 7.6992 +/- 0.05345

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ2_KL.gguf \
    IQ2_KL \
    192

IQ2_KT 9.469 GiB (2.664 BPW)

Final estimate: PPL = 8.0207 +/- 0.05638

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]
blk\.(0)\.attn_q.*=iq5_ks
blk\.(0)\.attn_k.*=iq6_k
blk\.(0)\.attn_v.*=iq6_k
blk\.(0)\.attn_output.*=iq5_ks

# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq4_kt

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=iq4_kt
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=iq4_kt

blk\..*\.ffn_down_exps\.weight=iq3_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt

# Non-Repeating Layers
token_embd\.weight=iq4_kt
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ2_KT.gguf \
    IQ2_KT \
    192

IQ1_KT 7.583 GiB (2.133 BPW)

Final estimate: PPL = 8.8341 +/- 0.06231

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]
blk\.(0)\.attn_q.*=iq5_ks
blk\.(0)\.attn_k.*=iq6_k
blk\.(0)\.attn_v.*=iq6_k
blk\.(0)\.attn_output.*=iq5_ks

# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq4_kt

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=iq4_kt
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=iq4_kt

blk\..*\.ffn_down_exps\.weight=iq2_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt

# Non-Repeating Layers
token_embd\.weight=iq4_kt
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ1_KT.gguf \
    IQ1_KT \
    192

Quick Start

Full GPU Offload with CUDA or Vulkan (for AMD GPUs)

# Compile CUDA backend
cmake -B ./build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_F16=ON
cmake --build ./build --config Release -j $(nproc)

# Compile Vulkan backend
# Experimental doesn't work with all quant types, need to test some more
# https://github.com/ikawrakow/ik_llama.cpp/discussions/590
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_HIPBLAS=0 -DGGML_VULKAN=1
cmake --build build --config Release -j $(nproc)

# Run Server
./build/bin/llama-server \
    --model Qwen3-30B-A3B-Thinking-2507-IQ3_KS.gguf \
    --alias ubergarm/Qwen3-30B-A3B-Thinking-2507 \
    --ctx-size 32768 \
    -ctk q8_0 -ctv q8_0 \
    -fa -fmoe \
    -ngl 99 \
    --parallel 1 \
    --threads 1 \
    --host 127.0.0.1 \
    --port 8080

CPU-only Backend

# Compile
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=0 -DGGML_VULKAN=0
cmake --build build --config Release -j $(nproc)

# Run Server
./build/bin/llama-server \
    --model Qwen3-30B-A3B-Thinking-2507-IQ3_KS.gguf \
    --alias ubergarm/Qwen3-30B-A3B-Thinking-2507 \
    --ctx-size 32768 \
    -ctk q8_0 -ctv q8_0 \
    -fa -fmoe \
    -ub 4096 -b 4096 \
    --parallel 1 \
    --threads 8 \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap

imatrix note

I used @eaddario's eaddario-imatrix-corpus-combined-all-medium converted to text like so:

$ apt-get install duckdb
$ duckdb -ascii -c "SELECT * FROM read_parquet('combined_all_medium.parquet');" > eaddario-imatrix-corpus-combined-all-medium.txt
$ du -h eaddario-imatrix-corpus-combined-all-medium.txt
9.4M    eaddario-imatrix-corpus-combined-all-medium.txt
$ sha1sum eaddario-imatrix-corpus-combined-all-medium.txt
4cde1d5401abdc399b22ab9ede82b63684ad6bb4  eaddario-imatrix-corpus-combined-all-medium.txt

References

Downloads last month
291
GGUF
Model size
30.5B params
Architecture
qwen3moe
Hardware compatibility
Log In to view the estimation

2-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF

Quantized
(34)
this model