`ik_llama.cpp` imatrix Quantizations of Qwen/Qwen3-30B-A3B-Thinking-2507

This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!

Quant Collection

Perplexity computed against wiki.test.raw.

These first three are just test quants for baseline perplexity comparison:

bf16 56.894 GiB (16.007 BPW)
- Final estimate: PPL = 7.3149 +/- 0.05076
Q8_0 30.247 GiB (8.510 BPW)
- Final estimate: PPL = 7.3284 +/- 0.05091
Q4_0 16.111 GiB (4.533 BPW)
- Final estimate: PPL = 7.4534 +/- 0.05151

`IQ5_K` 21.324 GiB (5.999 BPW)

Final estimate: PPL = 7.3440 +/- 0.05091

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k

# Non-Repeating Layers
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ5_K.gguf \
    IQ5_K \
    192

`IQ4_K` 17.878 GiB (5.030 BPW)

Final estimate: PPL = 7.3634 +/- 0.05104

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq5_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ4_K.gguf \
    IQ4_K \
    192

`IQ4_KSS` 15.531 GiB (4.370 BPW)

Final estimate: PPL = 7.3861 +/- 0.05128

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ4_KSS.gguf \
    IQ4_KSS \
    192

`IQ4_KT` 14.438 GiB (4.062 BPW)

Final estimate: PPL = 7.5020 +/- 0.05230

Mostly pure IQ4_KT meant for full GPU offload similar to turboderp-org/exllamav3 check out ArtusDev's HuggingFace Page for someh excellent EXL3 quants!

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq4_kt
blk\..*\.attn_v.*=iq4_kt
blk\..*\.attn_output.*=iq4_kt

# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq4_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kt

# Non-Repeating Layers
token_embd\.weight=iq4_kt
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ4_KT.gguf \
    IQ4_KT \
    192

`IQ3_K` 14.509 GiB (4.082 BPW)

Final estimate: PPL = 7.4360 +/- 0.05162

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq4_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ3_K.gguf \
    IQ3_K \
    192

`IQ3_KS` 13.633 GiB (3.836 BPW)

Final estimate: PPL = 7.4959 +/- 0.05204

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq4_ks
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq4_ks

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ3_KS.gguf \
    IQ3_KS \
    192

`IQ2_KL` 11.516 GiB (3.240 BPW)

Final estimate: PPL = 7.6992 +/- 0.05345

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]

# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0

blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ2_KL.gguf \
    IQ2_KL \
    192

`IQ2_KT` 9.469 GiB (2.664 BPW)

Final estimate: PPL = 8.0207 +/- 0.05638

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]
blk\.(0)\.attn_q.*=iq5_ks
blk\.(0)\.attn_k.*=iq6_k
blk\.(0)\.attn_v.*=iq6_k
blk\.(0)\.attn_output.*=iq5_ks

# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq4_kt

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=iq4_kt
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=iq4_kt

blk\..*\.ffn_down_exps\.weight=iq3_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt

# Non-Repeating Layers
token_embd\.weight=iq4_kt
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ2_KT.gguf \
    IQ2_KT \
    192

`IQ1_KT` 7.583 GiB (2.133 BPW)

Final estimate: PPL = 8.8341 +/- 0.06231

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 48 Repeating Layers [0-47]
blk\.(0)\.attn_q.*=iq5_ks
blk\.(0)\.attn_k.*=iq6_k
blk\.(0)\.attn_v.*=iq6_k
blk\.(0)\.attn_output.*=iq5_ks

# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq4_kt

# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=iq4_kt
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=iq4_kt

blk\..*\.ffn_down_exps\.weight=iq2_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt

# Non-Repeating Layers
token_embd\.weight=iq4_kt
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
    /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ1_KT.gguf \
    IQ1_KT \
    192

Quick Start

Full GPU Offload with CUDA or Vulkan (for AMD GPUs)

# Compile CUDA backend
cmake -B ./build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_F16=ON
cmake --build ./build --config Release -j $(nproc)

# Compile Vulkan backend
# Experimental doesn't work with all quant types, need to test some more
# https://github.com/ikawrakow/ik_llama.cpp/discussions/590
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_HIPBLAS=0 -DGGML_VULKAN=1
cmake --build build --config Release -j $(nproc)

# Run Server
./build/bin/llama-server \
    --model Qwen3-30B-A3B-Thinking-2507-IQ3_KS.gguf \
    --alias ubergarm/Qwen3-30B-A3B-Thinking-2507 \
    --ctx-size 32768 \
    -ctk q8_0 -ctv q8_0 \
    -fa -fmoe \
    -ngl 99 \
    --parallel 1 \
    --threads 1 \
    --host 127.0.0.1 \
    --port 8080

CPU-only Backend

# Compile
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=0 -DGGML_VULKAN=0
cmake --build build --config Release -j $(nproc)

# Run Server
./build/bin/llama-server \
    --model Qwen3-30B-A3B-Thinking-2507-IQ3_KS.gguf \
    --alias ubergarm/Qwen3-30B-A3B-Thinking-2507 \
    --ctx-size 32768 \
    -ctk q8_0 -ctv q8_0 \
    -fa -fmoe \
    -ub 4096 -b 4096 \
    --parallel 1 \
    --threads 8 \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap

imatrix note

I used @eaddario's eaddario-imatrix-corpus-combined-all-medium converted to text like so:

$ apt-get install duckdb
$ duckdb -ascii -c "SELECT * FROM read_parquet('combined_all_medium.parquet');" > eaddario-imatrix-corpus-combined-all-medium.txt
$ du -h eaddario-imatrix-corpus-combined-all-medium.txt
9.4M    eaddario-imatrix-corpus-combined-all-medium.txt
$ sha1sum eaddario-imatrix-corpus-combined-all-medium.txt
4cde1d5401abdc399b22ab9ede82b63684ad6bb4  eaddario-imatrix-corpus-combined-all-medium.txt

References

Downloads last month: 291

GGUF

Model size

30.5B params

Architecture

qwen3moe

Hardware compatibility

2-bit

4-bit

View +1 variant

Inference Providers NEW

Text Generation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF

Base model

Qwen/Qwen3-30B-A3B-Thinking-2507

Quantized

(34)

this model

ik_llama.cpp imatrix Quantizations of Qwen/Qwen3-30B-A3B-Thinking-2507

Big Thanks

Quant Collection

IQ5_K 21.324 GiB (5.999 BPW)

IQ4_K 17.878 GiB (5.030 BPW)

IQ4_KSS 15.531 GiB (4.370 BPW)

IQ4_KT 14.438 GiB (4.062 BPW)

IQ3_K 14.509 GiB (4.082 BPW)

IQ3_KS 13.633 GiB (3.836 BPW)

IQ2_KL 11.516 GiB (3.240 BPW)

IQ2_KT 9.469 GiB (2.664 BPW)

IQ1_KT 7.583 GiB (2.133 BPW)

Quick Start

Full GPU Offload with CUDA or Vulkan (for AMD GPUs)

CPU-only Backend

imatrix note

References

Model tree for ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF

`ik_llama.cpp` imatrix Quantizations of Qwen/Qwen3-30B-A3B-Thinking-2507

`IQ5_K` 21.324 GiB (5.999 BPW)

`IQ4_K` 17.878 GiB (5.030 BPW)

`IQ4_KSS` 15.531 GiB (4.370 BPW)

`IQ4_KT` 14.438 GiB (4.062 BPW)

`IQ3_K` 14.509 GiB (4.082 BPW)

`IQ3_KS` 13.633 GiB (3.836 BPW)

`IQ2_KL` 11.516 GiB (3.240 BPW)

`IQ2_KT` 9.469 GiB (2.664 BPW)

`IQ1_KT` 7.583 GiB (2.133 BPW)