ik_llama.cpp
imatrix Quantizations of Qwen/Qwen3-30B-A3B-Instruct-2507
This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
NOTE ik_llama.cpp
can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP.
These quants provide best in class perplexity for the given memory footprint.
Big Thanks
Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!
Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!
Quant Collection
Perplexity computed against wiki.test.raw.
These first two are just test quants for baseline perplexity comparison:
bf16
56.894 GiB (16.007 BPW)- Final estimate: PPL = 7.3594 +/- 0.05170
Q8_0
30.247 GiB (8.510 BPW)- Final estimate: PPL = 7.3606 +/- 0.05171
IQ5_K
21.324 GiB (5.999 BPW)
Final estimate: PPL = 7.3806 +/- 0.05170
๐ Secret Recipe
#!/usr/bin/env bash
custom="
# 48 Repeating Layers [0-47]
# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0
blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k
# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0
blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k
# Non-Repeating Layers
token_embd\.weight=iq6_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ5_K.gguf \
IQ5_K \
192
IQ4_K
17.878 GiB (5.030 BPW)
Final estimate: PPL = 7.3951 +/- 0.05178
๐ Secret Recipe
#!/usr/bin/env bash
custom="
# 48 Repeating Layers [0-47]
# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0
blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k
# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0
blk\..*\.ffn_down_exps\.weight=iq5_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k
# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ4_K.gguf \
IQ4_K \
192
IQ4_KSS
15.531 GiB (4.370 BPW)
Final estimate: PPL = 7.4392 +/- 0.05225
๐ Secret Recipe
#!/usr/bin/env bash
custom="
# 48 Repeating Layers [0-47]
# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0
blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k
# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0
blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ4_KSS.gguf \
IQ4_KSS \
192
IQ3_K
14.509 GiB (4.082 BPW)
Final estimate: PPL = 7.4991 +/- 0.05269
๐ Secret Recipe
#!/usr/bin/env bash
custom="
# 48 Repeating Layers [0-47]
# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0
blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k
# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0
blk\..*\.ffn_down_exps\.weight=iq4_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k
# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ3_K.gguf \
IQ3_K \
192
IQ3_KS
13.633 GiB (3.836 BPW)
Final estimate: PPL = 7.5512 +/- 0.05307
๐ Secret Recipe
#!/usr/bin/env bash
custom="
# 48 Repeating Layers [0-47]
# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0
blk\..*\.attn_q.*=iq4_ks
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq4_ks
# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0
blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks
# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ3_KS.gguf \
IQ3_KS \
192
IQ2_KL
11.516 GiB (3.240 BPW)
Final estimate: PPL = 7.7121 +/- 0.05402
๐ Secret Recipe
#!/usr/bin/env bash
custom="
# 48 Repeating Layers [0-47]
# Attention
blk\.(0)\.attn_q.*=q8_0
blk\.(0)\.attn_k.*=q8_0
blk\.(0)\.attn_v.*=q8_0
blk\.(0)\.attn_output.*=q8_0
blk\..*\.attn_q.*=iq5_k
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_k
# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ2_KL.gguf \
IQ2_KL \
192
IQ2_KT
9.469 GiB (2.664 BPW)
Final estimate: PPL = 8.0270 +/- 0.05698
๐ Secret Recipe
#!/usr/bin/env bash
custom="
# 48 Repeating Layers [0-47]
blk\.(0)\.attn_q.*=iq5_ks
blk\.(0)\.attn_k.*=iq6_k
blk\.(0)\.attn_v.*=iq6_k
blk\.(0)\.attn_output.*=iq5_ks
# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq4_kt
# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=iq4_kt
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=iq4_kt
blk\..*\.ffn_down_exps\.weight=iq3_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt
# Non-Repeating Layers
token_embd\.weight=iq4_kt
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ2_KT.gguf \
IQ2_KT \
192
IQ1_KT
7.583 GiB (2.133 BPW)
Final estimate: PPL = 8.7273 +/- 0.06185
๐ Secret Recipe
#!/usr/bin/env bash
custom="
# 48 Repeating Layers [0-47]
blk\.(0)\.attn_q.*=iq5_ks
blk\.(0)\.attn_k.*=iq6_k
blk\.(0)\.attn_v.*=iq6_k
blk\.(0)\.attn_output.*=iq5_ks
# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq4_kt
# Routed Experts
blk\.(0|47)\.ffn_down_exps\.weight=iq4_kt
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=iq4_kt
blk\..*\.ffn_down_exps\.weight=iq2_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt
# Non-Repeating Layers
token_embd\.weight=iq4_kt
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Instruct-2507-BF16.dat \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-BF16-00001-of-00002.gguf \
/mnt/raid/models/ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF/Qwen3-30B-A3B-Instruct-2507-IQ1_KT.gguf \
IQ1_KT \
192
Quick Start
Full GPU Offload with CUDA or or Vulkan (for AMD GPUs)
# Compile CUDA backend
cmake -B ./build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_F16=ON
cmake --build ./build --config Release -j $(nproc)
# Compile Vulkan backend
# Experimental doesn't work with all quant types, need to test some more
# https://github.com/ikawrakow/ik_llama.cpp/discussions/590
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_HIPBLAS=0 -DGGML_VULKAN=1
cmake --build build --config Release -j $(nproc)
# Run Server
./build/bin/llama-server \
--model Qwen3-30B-A3B-Instruct-2507-IQ3_KS.gguf \
--alias ubergarm/Qwen3-30B-A3B-Instruct-2507 \
--ctx-size 32768 \
-ctk q8_0 -ctv q8_0 \
-fa -fmoe \
-ngl 99 \
--parallel 1 \
--threads 1 \
--host 127.0.0.1 \
--port 8080
CPU-only Backend
# Compile
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=0 -DGGML_VULKAN=0
cmake --build build --config Release -j $(nproc)
# Run Server
./build/bin/llama-server \
--model Qwen3-30B-A3B-Instruct-2507-IQ3_KS.gguf \
--alias ubergarm/Qwen3-30B-A3B-Instruct-2507 \
--ctx-size 32768 \
-ctk q8_0 -ctv q8_0 \
-fa -fmoe \
-ub 4096 -b 4096 \
--parallel 1 \
--threads 8 \
--host 127.0.0.1 \
--port 8080 \
--no-mmap
imatrix note
I used @eaddario's eaddario-imatrix-corpus-combined-all-medium converted to text like so:
$ apt-get install duckdb
$ duckdb -ascii -c "SELECT * FROM read_parquet('combined_all_medium.parquet');" > eaddario-imatrix-corpus-combined-all-medium.txt
$ du -h eaddario-imatrix-corpus-combined-all-medium.txt
9.4M eaddario-imatrix-corpus-combined-all-medium.txt
$ sha1sum eaddario-imatrix-corpus-combined-all-medium.txt
4cde1d5401abdc399b22ab9ede82b63684ad6bb4 eaddario-imatrix-corpus-combined-all-medium.txt
References
- Downloads last month
- 616
2-bit
Model tree for ubergarm/Qwen3-30B-A3B-Instruct-2507-GGUF
Base model
Qwen/Qwen3-30B-A3B-Instruct-2507