ik_llama.cpp imatrix Quantizations of Qwen/Qwen3-30B-A3B-Instruct-2507

This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!

Quant Collection

Perplexity computed against wiki.test.raw.

Perplexity Chart

These first two are just test quants for baseline perplexity comparison:

  • bf16 56.894 GiB (16.007 BPW)
    • Final estimate: PPL = 7.3594 +/- 0.05170
  • Q8_0 30.247 GiB (8.510 BPW)
    • Final estimate: PPL = 7.3606 +/- 0.05171

IQ5_K 21.324 GiB (5.999 BPW)

Final estimate: PPL = 7.3806 +/- 0.05170

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k

# Non-Repeating Layers
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ5_K.gguf \
    IQ5_K \
    192

IQ4_K 17.878 GiB (5.030 BPW)

Final estimate: PPL = 7.3951 +/- 0.05178

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq5_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ4_K.gguf \
    IQ4_K \
    192

IQ4_KSS 15.531 GiB (4.370 BPW)

Final estimate: PPL = 7.4392 +/- 0.05225

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ4_KSS.gguf \
    IQ4_KSS \
    192

IQ3_K 14.509 GiB (4.082 BPW)

Final estimate: PPL = 7.4991 +/- 0.05269

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq4_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ3_K.gguf \
    IQ3_K \
    192

IQ3_KS 13.633 GiB (3.836 BPW)

Final estimate: PPL = 7.5512 +/- 0.05307

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq4_ks
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq4_ks

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ3_KS.gguf \
    IQ3_KS \
    192

IQ2_KL 11.516 GiB (3.240 BPW)

Final estimate: PPL = 7.7121 +/- 0.05402

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ2_KL.gguf \
    IQ2_KL \
    192

IQ2_KT 9.469 GiB (2.664 BPW)

Final estimate: PPL = 8.0270 +/- 0.05698

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]
blk\.(0)\.attn_q.*=iq5_ks
blk\.(0)\.attn_k.*=iq6_k
blk\.(0)\.attn_v.*=iq6_k
blk\.(0)\.attn_output.*=iq5_ks

# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq4_kt

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=iq4_kt
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=iq4_kt

blk\..*\.ffn_down_exps\.weight=iq3_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt

# Non-Repeating Layers
token_embd\.weight=iq4_kt
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ2_KT.gguf \
    IQ2_KT \
    192

IQ1_KT 7.583 GiB (2.133 BPW)

Final estimate: PPL = 8.7273 +/- 0.06185

๐Ÿ‘ˆ Secret Recipe
#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]
blk\.(0)\.attn_q.*=iq5_ks
blk\.(0)\.attn_k.*=iq6_k
blk\.(0)\.attn_v.*=iq6_k
blk\.(0)\.attn_output.*=iq5_ks

# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq4_kt

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=iq4_kt
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=iq4_kt

blk\..*\.ffn_down_exps\.weight=iq2_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt

# Non-Repeating Layers
token_embd\.weight=iq4_kt
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ1_KT.gguf \
    IQ1_KT \
    192

Quick Start

Full GPU Offload with CUDA or or Vulkan (for AMD GPUs)

# Compile CUDA backend
cmake -B ./build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_F16=ON
cmake --build ./build --config Release -j $(nproc)

# Compile Vulkan backend
# Experimental doesn't work with all quant types, need to test some more
# https://github.com/ikawrakow/ik_llama.cpp/discussions/590
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_HIPBLAS=0 -DGGML_VULKAN=1
cmake --build build --config Release -j $(nproc)

# Run Server
./build/bin/llama-server \
    --model Qwen3-30B-A3B-Instruct-2507-IQ3_KS.gguf \
    --alias ubergarm/Qwen3-30B-A3B-Instruct-2507 \
    --ctx-size 32768 \
    -ctk q8_0 -ctv q8_0 \
    -fa -fmoe \
    -ngl 99 \
    --parallel 1 \
    --threads 1 \
    --host 127.0.0.1 \
    --port 8080

CPU-only Backend

# Compile
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=0 -DGGML_VULKAN=0
cmake --build build --config Release -j $(nproc)

# Run Server
./build/bin/llama-server \
    --model Qwen3-30B-A3B-Instruct-2507-IQ3_KS.gguf \
    --alias ubergarm/Qwen3-30B-A3B-Instruct-2507 \
    --ctx-size 32768 \
    -ctk q8_0 -ctv q8_0 \
    -fa -fmoe \
    -ub 4096 -b 4096 \
    --parallel 1 \
    --threads 8 \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap

imatrix note

I used @eaddario's eaddario-imatrix-corpus-combined-all-medium converted to text like so:

$ apt-get install duckdb
$ duckdb -ascii -c "SELECT * FROM read_parquet('combined_all_medium.parquet');" > eaddario-imatrix-corpus-combined-all-medium.txt
$ du -h eaddario-imatrix-corpus-combined-all-medium.txt
9.4M    eaddario-imatrix-corpus-combined-all-medium.txt
$ sha1sum eaddario-imatrix-corpus-combined-all-medium.txt
4cde1d5401abdc399b22ab9ede82b63684ad6bb4  eaddario-imatrix-corpus-combined-all-medium.txt

References

Downloads last month
616
GGUF
Model size
30.5B params
Architecture
qwen3moe
Hardware compatibility
Log In to view the estimation

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF

Quantized
(65)
this model