Llama.cpp hybrid layer quantization of Qwen3-32B by Alibaba

Original model: https://huggingface.co/Qwen/Qwen3-32B

The hybrid quant employs different quantization levels on a per layer basis to increased flexibility of trading off performance vs file size. Less parameter bits are used at deep layers and more bits at cortex layers to simultaneously optimize quantized size and model performance. These quants were specifically optimized for the Qwen3 32B model for size similar to IQ4_XS while meeting or exceeding performance of the IQ4_XS quant using K quants in all the layers for faster CPU processing on partially offloaded models.

The layer quants are as follows:

   LAYER_TYPES='[
   [0 ,"Q3_K_M"],[1 ,"Q3_K_M"],[2 ,"Q3_K_M"],[3 ,"Q3_K_M"],[4 ,"Q3_K_M"],[5 ,"Q3_K_M"],[6 ,"Q3_K_M"],[7 ,"Q3_K_M"],
   [8 ,"Q3_K_M"],[9 ,"Q3_K_M"],[10,"Q3_K_M"],[11,"Q3_K_M"],[12,"Q3_K_M"],[13,"Q3_K_M"],[14,"Q3_K_M"],[15,"Q3_K_M"],
   [16,"Q3_K_L"],[17,"Q3_K_M"],[18,"Q3_K_L"],[19,"Q3_K_M"],[20,"Q3_K_L"],[21,"Q3_K_M"],[22,"Q3_K_L"],[23,"Q3_K_M"],
   [24,"Q3_K_L"],[25,"Q3_K_L"],[26,"Q3_K_L"],[27,"Q3_K_L"],[28,"Q3_K_L"],[29,"Q3_K_L"],[30,"Q3_K_L"],[31,"Q3_K_L"],
   [32,"Q4_K_S"],[33,"Q3_K_L"],[34,"Q4_K_S"],[35,"Q3_K_L"],[36,"Q4_K_S"],[37,"Q3_K_L"],[38,"Q4_K_S"],[39,"Q3_K_L"],
   [40,"Q4_K_S"],[41,"Q4_K_S"],[42,"Q4_K_S"],[43,"Q4_K_S"],[44,"Q4_K_S"],[45,"Q4_K_S"],[46,"Q4_K_S"],[47,"Q4_K_S"],
   [48,"Q4_K_M"],[49,"Q4_K_S"],[50,"Q4_K_M"],[51,"Q4_K_S"],[52,"Q4_K_M"],[53,"Q4_K_S"],[54,"Q4_K_M"],[55,"Q4_K_S"],
   [56,"Q4_K_M"],[57,"Q4_K_M"],[58,"Q4_K_M"],[59,"Q4_K_M"],[60,"Q4_K_M"],[61,"Q4_K_M"],[62,"Q4_K_M"],[63,"Q4_K_M"]
   ]'
   FLAGS="--token-embedding-type Q4_K --output-tensor-type Q6_K"

These quants were select based on combined subjective and objective performance evaluations to give both high performance and reduced file size.

Comparison:

Quant size PPL Comment
IQ4_XS 17.9e9 7.8 default embed and output
Q4_K_H 17.9e9 7.8 Q4_K embed Q6_K output

Full evals for Q4_K_H quant are available at https://huggingface.co/spaces/steampunque/benchlm

Download the file from below:

Link Type Size/e9 B Notes
Qwen3-32B.Q4_K_H.gguf Q4_K_H 17.9e9 B IQ4_XS+ quality

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

Downloads last month
20
GGUF
Model size
32.8B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for steampunque/Qwen3-32B-Hybrid-GGUF

Base model

Qwen/Qwen3-32B
Quantized
(98)
this model