`ik_llama.cpp` imatrix Quantizations of Qwen/Qwen3-30B-A3B-Instruct-2507

This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!

Quant Collection

Perplexity computed against wiki.test.raw.

These first two are just test quants for baseline perplexity comparison:

bf16 56.894 GiB (16.007 BPW)
- Final estimate: PPL = 7.3594 +/- 0.05170
Q8_0 30.247 GiB (8.510 BPW)
- Final estimate: PPL = 7.3606 +/- 0.05171

`IQ5_K` 21.324 GiB (5.999 BPW)

Final estimate: PPL = 7.3806 +/- 0.05170

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k

# Non-Repeating Layers
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ5_K.gguf \
    IQ5_K \
    192

`IQ4_K` 17.878 GiB (5.030 BPW)

Final estimate: PPL = 7.3951 +/- 0.05178

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq5_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ4_K.gguf \
    IQ4_K \
    192

`IQ4_KSS` 15.531 GiB (4.370 BPW)

Final estimate: PPL = 7.4392 +/- 0.05225

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ4_KSS.gguf \
    IQ4_KSS \
    192

`IQ3_K` 14.509 GiB (4.082 BPW)

Final estimate: PPL = 7.4991 +/- 0.05269

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq4_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ3_K.gguf \
    IQ3_K \
    192

`IQ3_KS` 13.633 GiB (3.836 BPW)

Final estimate: PPL = 7.5512 +/- 0.05307

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq4_ks
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq4_ks

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ3_KS.gguf \
    IQ3_KS \
    192

`IQ2_KL` 11.516 GiB (3.240 BPW)

Final estimate: PPL = 7.7121 +/- 0.05402

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ2_KL.gguf \
    IQ2_KL \
    192

`IQ2_KT` 9.469 GiB (2.664 BPW)

Final estimate: PPL = 8.0270 +/- 0.05698

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]
blk\.(0)\.attn_q.*=iq5_ks
blk\.(0)\.attn_k.*=iq6_k
blk\.(0)\.attn_v.*=iq6_k
blk\.(0)\.attn_output.*=iq5_ks

# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq4_kt

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=iq4_kt
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=iq4_kt

blk\..*\.ffn_down_exps\.weight=iq3_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt

# Non-Repeating Layers
token_embd\.weight=iq4_kt
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ2_KT.gguf \
    IQ2_KT \
    192

`IQ1_KT` 7.583 GiB (2.133 BPW)

Final estimate: PPL = 8.7273 +/- 0.06185

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]
blk\.(0)\.attn_q.*=iq5_ks
blk\.(0)\.attn_k.*=iq6_k
blk\.(0)\.attn_v.*=iq6_k
blk\.(0)\.attn_output.*=iq5_ks

# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq4_kt

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=iq4_kt
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=iq4_kt

blk\..*\.ffn_down_exps\.weight=iq2_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt

# Non-Repeating Layers
token_embd\.weight=iq4_kt
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ1_KT.gguf \
    IQ1_KT \
    192

Quick Start

Full GPU Offload with CUDA or or Vulkan (for AMD GPUs)

# Compile CUDA backend
cmake -B ./build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_F16=ON
cmake --build ./build --config Release -j $(nproc)

# Compile Vulkan backend
# Experimental doesn't work with all quant types, need to test some more
# https://github.com/ikawrakow/ik_llama.cpp/discussions/590
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_HIPBLAS=0 -DGGML_VULKAN=1
cmake --build build --config Release -j $(nproc)

# Run Server
./build/bin/llama-server \
    --model Qwen3-30B-A3B-Instruct-2507-IQ3_KS.gguf \
    --alias ubergarm/Qwen3-30B-A3B-Instruct-2507 \
    --ctx-size 32768 \
    -ctk q8_0 -ctv q8_0 \
    -fa -fmoe \
    -ngl 99 \
    --parallel 1 \
    --threads 1 \
    --host 127.0.0.1 \
    --port 8080

CPU-only Backend

# Compile
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=0 -DGGML_VULKAN=0
cmake --build build --config Release -j $(nproc)

# Run Server
./build/bin/llama-server \
    --model Qwen3-30B-A3B-Instruct-2507-IQ3_KS.gguf \
    --alias ubergarm/Qwen3-30B-A3B-Instruct-2507 \
    --ctx-size 32768 \
    -ctk q8_0 -ctv q8_0 \
    -fa -fmoe \
    -ub 4096 -b 4096 \
    --parallel 1 \
    --threads 8 \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap

imatrix note

I used @eaddario's eaddario-imatrix-corpus-combined-all-medium converted to text like so:

$ apt-get install duckdb
$ duckdb -ascii -c "SELECT * FROM read_parquet('combined_all_medium.parquet');" > eaddario-imatrix-corpus-combined-all-medium.txt
$ du -h eaddario-imatrix-corpus-combined-all-medium.txt
9.4M    eaddario-imatrix-corpus-combined-all-medium.txt
$ sha1sum eaddario-imatrix-corpus-combined-all-medium.txt
4cde1d5401abdc399b22ab9ede82b63684ad6bb4  eaddario-imatrix-corpus-combined-all-medium.txt

References

Downloads last month: 671

GGUF

Model size

30.5B params

Architecture

qwen3moe

Hardware compatibility

2-bit

View +1 variant

Model tree for ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF

Base model

Qwen/Qwen3-30B-A3B-Instruct-2507

Quantized

(98)

this model

ik_llama.cpp imatrix Quantizations of Qwen/Qwen3-30B-A3B-Instruct-2507

Big Thanks

Quant Collection

IQ5_K 21.324 GiB (5.999 BPW)

IQ4_K 17.878 GiB (5.030 BPW)

IQ4_KSS 15.531 GiB (4.370 BPW)

IQ3_K 14.509 GiB (4.082 BPW)

IQ3_KS 13.633 GiB (3.836 BPW)

IQ2_KL 11.516 GiB (3.240 BPW)

IQ2_KT 9.469 GiB (2.664 BPW)

IQ1_KT 7.583 GiB (2.133 BPW)

Quick Start

Full GPU Offload with CUDA or or Vulkan (for AMD GPUs)

CPU-only Backend

imatrix note

References

Model tree for ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF

`ik_llama.cpp` imatrix Quantizations of Qwen/Qwen3-30B-A3B-Instruct-2507

`IQ5_K` 21.324 GiB (5.999 BPW)

`IQ4_K` 17.878 GiB (5.030 BPW)

`IQ4_KSS` 15.531 GiB (4.370 BPW)

`IQ3_K` 14.509 GiB (4.082 BPW)

`IQ3_KS` 13.633 GiB (3.836 BPW)

`IQ2_KL` 11.516 GiB (3.240 BPW)

`IQ2_KT` 9.469 GiB (2.664 BPW)

`IQ1_KT` 7.583 GiB (2.133 BPW)