Llama.cpp hybrid layer quantization of Qwen3-32B by Alibaba
Original model: https://huggingface.co/Qwen/Qwen3-32B
The hybrid quant employs different quantization levels on a per layer basis to increased flexibility of trading off performance vs file size. Less parameter bits are used at deep layers and more bits at cortex layers to simultaneously optimize quantized size and model performance. These quants were specifically optimized for the Qwen3 32B model for size similar to IQ4_XS while meeting or exceeding performance of the IQ4_XS quant using K quants in all the layers for faster CPU processing on partially offloaded models.
The layer quants are as follows:
LAYER_TYPES='[
[0 ,"Q3_K_M"],[1 ,"Q3_K_M"],[2 ,"Q3_K_M"],[3 ,"Q3_K_M"],[4 ,"Q3_K_M"],[5 ,"Q3_K_M"],[6 ,"Q3_K_M"],[7 ,"Q3_K_M"],
[8 ,"Q3_K_M"],[9 ,"Q3_K_M"],[10,"Q3_K_M"],[11,"Q3_K_M"],[12,"Q3_K_M"],[13,"Q3_K_M"],[14,"Q3_K_M"],[15,"Q3_K_M"],
[16,"Q3_K_L"],[17,"Q3_K_M"],[18,"Q3_K_L"],[19,"Q3_K_M"],[20,"Q3_K_L"],[21,"Q3_K_M"],[22,"Q3_K_L"],[23,"Q3_K_M"],
[24,"Q3_K_L"],[25,"Q3_K_L"],[26,"Q3_K_L"],[27,"Q3_K_L"],[28,"Q3_K_L"],[29,"Q3_K_L"],[30,"Q3_K_L"],[31,"Q3_K_L"],
[32,"Q4_K_S"],[33,"Q3_K_L"],[34,"Q4_K_S"],[35,"Q3_K_L"],[36,"Q4_K_S"],[37,"Q3_K_L"],[38,"Q4_K_S"],[39,"Q3_K_L"],
[40,"Q4_K_S"],[41,"Q4_K_S"],[42,"Q4_K_S"],[43,"Q4_K_S"],[44,"Q4_K_S"],[45,"Q4_K_S"],[46,"Q4_K_S"],[47,"Q4_K_S"],
[48,"Q4_K_M"],[49,"Q4_K_S"],[50,"Q4_K_M"],[51,"Q4_K_S"],[52,"Q4_K_M"],[53,"Q4_K_S"],[54,"Q4_K_M"],[55,"Q4_K_S"],
[56,"Q4_K_M"],[57,"Q4_K_M"],[58,"Q4_K_M"],[59,"Q4_K_M"],[60,"Q4_K_M"],[61,"Q4_K_M"],[62,"Q4_K_M"],[63,"Q4_K_M"]
]'
FLAGS="--token-embedding-type Q4_K --output-tensor-type Q6_K"
These quants were select based on combined subjective and objective performance evaluations to give both high performance and reduced file size.
Comparison:
Quant | size | PPL | Comment |
---|---|---|---|
IQ4_XS | 17.9e9 | 7.8 | default embed and output |
Q4_K_H | 17.9e9 | 7.8 | Q4_K embed Q6_K output |
Full evals for Q4_K_H quant are available at https://huggingface.co/spaces/steampunque/benchlm
Download the file from below:
Link | Type | Size/e9 B | Notes |
---|---|---|---|
Qwen3-32B.Q4_K_H.gguf | Q4_K_H | 17.9e9 B | IQ4_XS+ quality |
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
- Downloads last month
- 20
Model tree for steampunque/Qwen3-32B-Hybrid-GGUF
Base model
Qwen/Qwen3-32B