README.md · ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF at main

metadata

quantized_by: ubergarm
pipeline_tag: text-generation
base_model: Qwen/Qwen3-235B-A22B-Instruct-2507
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507/blob/main/LICENSE
base_model_relation: quantized
tags:
  - imatrix
  - conversational
  - ik_llama.cpp

`ik_llama.cpp` imatrix Quantizations of Qwen/Qwen3-235B-A22B-Instruct-2507

This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!

Quant Collection

Perplexity computed against wiki.test.raw. These first two are just test quants for baseline perplexity comparison:

bf16 437.989 GiB (16.003 BPW)
- Final estimate: PPL = 4.3079 +/- 0.02544
Q8_0 232.769 GiB (8.505 BPW)
- Final estimate: PPL = 4.3139 +/- 0.02550

`IQ5_K` 161.722 GiB (5.909 BPW)

Final estimate: PPL = 4.3351 +/- 0.02566

👈 Secret Recipe

#!/usr/bin/env bash

# Repeating Layers [0-93]

custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k

# Token Embedding
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/imatrix-Qwen3-235B-A22B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-BF16-00001-of-00010.gguf \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-IQ5_K.gguf \
    IQ5_K \
    192

`IQ4_K` 134.183 GiB (4.903 BPW)

Final estimate: PPL = 4.3668 +/- 0.02594

👈 Secret Recipe

#!/usr/bin/env bash

# Repeating Layers [0-93]

custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq5_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k

# Token Embedding
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/imatrix-Qwen3-235B-A22B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-BF16-00001-of-00010.gguf \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-IQ4_K.gguf \
    IQ4_K \
    192

`pure-IQ4_KS` 116.994 GiB (4.275 BPW)

Final estimate: PPL = 4.4156 +/- 0.02624

👈 Secret Recipe

#!/usr/bin/env bash

# Repeating Layers [0-93]

custom="
# Attention
blk\..*\.attn_k.*=iq4_ks
blk\..*\.attn_q.*=iq4_ks
blk\..*\.attn_v.*=iq4_ks
blk\..*\.attn_output.*=iq4_ks

# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_ks

# Token Embedding
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 1 -m 1 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/imatrix-Qwen3-235B-A22B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-BF16-00001-of-00010.gguf \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-eaddario-imat-pure-IQ4_KS.gguf \
    IQ4_KS \
    192

`IQ4_KSS` 115.085 GiB (4.205 BPW)

Final estimate: PPL = 4.4017 +/- 0.02614

This one is a little funky just for fun. Seems smort!

👈 Secret Recipe

#!/usr/bin/env bash

# Repeating Layers [0-93]

custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# Routed Experts
blk\.(0|1|2|3)\.ffn_down_exps\.weight=iq5_ks
blk\.(0|1|2|3)\.ffn_(gate|up)_exps\.weight=iq4_ks
blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

# Token Embedding
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/imatrix-Qwen3-235B-A22B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-BF16-00001-of-00010.gguf \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-IQ4_KSS.gguf \
    IQ4_KSS \
    192

`IQ3_K` 106.644 GiB (3.897 BPW)

Final estimate: PPL = 4.4561 +/- 0.02657

👈 Secret Recipe

#!/usr/bin/env bash

# Repeating Layers [0-93]

custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq4_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k

# Token Embedding
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 1 -m 1 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/imatrix-Qwen3-235B-A22B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-BF16-00001-of-00010.gguf \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-IQ3_K.gguf \
    IQ3_K \
    192

`IQ3_KS` 101.308 GiB (3.702 BPW)

Final estimate: PPL = 4.4915 +/- 0.02685

Another funky smort one!

👈 Secret Recipe

#!/usr/bin/env bash

# Repeating Layers [0-93]

custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# Routed Experts
blk\.(0|1|2|3)\.ffn_down_exps\.weight=iq5_ks
blk\.(0|1|2|3)\.ffn_(gate|up)_exps\.weight=iq4_ks
blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks

# Token Embedding
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/imatrix-Qwen3-235B-A22B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-BF16-00001-of-00010.gguf \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-IQ3_KS.gguf \
    IQ3_KS \
    192

`IQ2_KL` 81.866 GiB (2.991 BPW)

Final estimate: PPL = 4.7912 +/- 0.02910

👈 Secret Recipe

#!/usr/bin/env bash

# Repeating Layers [0-93]

custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

# Token Embedding
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/imatrix-Qwen3-235B-A22B-Instruct-2507-BF16.dat \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-BF16-00001-of-00010.gguf \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-IQ2_KL.gguf \
    IQ2_KL \
    192

Quick Start

This example is for a single CUDA GPU hybrid infrencing with CPU/RAM. Check ik_llama.cpp discussions or my other quants for more examples for multi-GPU etc.

./build/bin/llama-server \
  --model /models/IQ5_K/Qwen3-235B-A22B-Instruct-IQ5_K-00001-of-00004.gguf \
  --alias ubergarm/Qwen3-235B-A22B-Instruct-2507 \
  -fa -fmoe \
  -ctk q8_0 -ctv q8_0 \
  -c 32768 \
  -ngl 99 \
  -ot "blk\.[0-9]\.ffn.*=CUDA0" \
  -ot "blk.*\.ffn.*=CPU \
  --threads 16 \
  -ub 4096 -b 4096 \
  --host 127.0.0.1 \
  --port 8080

ubergarm
/

Qwen3-235B-A22B-Instruct-2507-GGUF

`ik_llama.cpp` imatrix Quantizations of Qwen/Qwen3-235B-A22B-Instruct-2507

Big Thanks

Quant Collection

`IQ5_K` 161.722 GiB (5.909 BPW)

`IQ4_K` 134.183 GiB (4.903 BPW)

`pure-IQ4_KS` 116.994 GiB (4.275 BPW)

`IQ4_KSS` 115.085 GiB (4.205 BPW)

`IQ3_K` 106.644 GiB (3.897 BPW)

`IQ3_KS` 101.308 GiB (3.702 BPW)

`IQ2_KL` 81.866 GiB (2.991 BPW)

Quick Start

References

ik_llama.cpp imatrix Quantizations of Qwen/Qwen3-235B-A22B-Instruct-2507

Big Thanks

Quant Collection

IQ5_K 161.722 GiB (5.909 BPW)

IQ4_K 134.183 GiB (4.903 BPW)

pure-IQ4_KS 116.994 GiB (4.275 BPW)

IQ4_KSS 115.085 GiB (4.205 BPW)

IQ3_K 106.644 GiB (3.897 BPW)

IQ3_KS 101.308 GiB (3.702 BPW)

IQ2_KL 81.866 GiB (2.991 BPW)

Quick Start

References

`ik_llama.cpp` imatrix Quantizations of Qwen/Qwen3-235B-A22B-Instruct-2507

`IQ5_K` 161.722 GiB (5.909 BPW)

`IQ4_K` 134.183 GiB (4.903 BPW)

`pure-IQ4_KS` 116.994 GiB (4.275 BPW)

`IQ4_KSS` 115.085 GiB (4.205 BPW)

`IQ3_K` 106.644 GiB (3.897 BPW)

`IQ3_KS` 101.308 GiB (3.702 BPW)

`IQ2_KL` 81.866 GiB (2.991 BPW)