steampunque/Qwen3-32B-Hybrid-GGUF

Llama.cpp hybrid layer quantization of Qwen3-32B by Alibaba

Original model: https://huggingface.co/Qwen/Qwen3-32B

The hybrid quant employs different quantization levels on a per layer basis to increased flexibility of trading off performance vs file size. Less parameter bits are used at deep layers and more bits at cortex layers to simultaneously optimize quantized size and model performance. These quants were specifically optimized for the Qwen3 32B model for size similar to IQ4_XS while meeting or exceeding performance of the IQ4_XS quant using K quants in all the layers for faster CPU processing on partially offloaded models.

The layer quants are as follows:

   LAYER_TYPES='[
   [0 ,"Q3_K_M"],[1 ,"Q3_K_M"],[2 ,"Q3_K_M"],[3 ,"Q3_K_M"],[4 ,"Q3_K_M"],[5 ,"Q3_K_M"],[6 ,"Q3_K_M"],[7 ,"Q3_K_M"],
   [8 ,"Q3_K_M"],[9 ,"Q3_K_M"],[10,"Q3_K_M"],[11,"Q3_K_M"],[12,"Q3_K_M"],[13,"Q3_K_M"],[14,"Q3_K_M"],[15,"Q3_K_M"],
   [16,"Q3_K_L"],[17,"Q3_K_M"],[18,"Q3_K_L"],[19,"Q3_K_M"],[20,"Q3_K_L"],[21,"Q3_K_M"],[22,"Q3_K_L"],[23,"Q3_K_M"],
   [24,"Q3_K_L"],[25,"Q3_K_L"],[26,"Q3_K_L"],[27,"Q3_K_L"],[28,"Q3_K_L"],[29,"Q3_K_L"],[30,"Q3_K_L"],[31,"Q3_K_L"],
   [32,"Q4_K_S"],[33,"Q3_K_L"],[34,"Q4_K_S"],[35,"Q3_K_L"],[36,"Q4_K_S"],[37,"Q3_K_L"],[38,"Q4_K_S"],[39,"Q3_K_L"],
   [40,"Q4_K_S"],[41,"Q4_K_S"],[42,"Q4_K_S"],[43,"Q4_K_S"],[44,"Q4_K_S"],[45,"Q4_K_S"],[46,"Q4_K_S"],[47,"Q4_K_S"],
   [48,"Q4_K_M"],[49,"Q4_K_S"],[50,"Q4_K_M"],[51,"Q4_K_S"],[52,"Q4_K_M"],[53,"Q4_K_S"],[54,"Q4_K_M"],[55,"Q4_K_S"],
   [56,"Q4_K_M"],[57,"Q4_K_M"],[58,"Q4_K_M"],[59,"Q4_K_M"],[60,"Q4_K_M"],[61,"Q4_K_M"],[62,"Q4_K_M"],[63,"Q4_K_M"]
   ]'
   FLAGS="--token-embedding-type Q4_K --output-tensor-type Q6_K"

These quants were select based on combined subjective and objective performance evaluations to give both high performance and reduced file size.

Comparison:

Quant	size	PPL	Comment
IQ4_XS	17.9e9	7.8	default embed and output
Q4_K_H	17.9e9	7.8	Q4_K embed Q6_K output

Full evals for Q4_K_H quant are available at https://huggingface.co/spaces/steampunque/benchlm

Download the file from below:

Link	Type	Size/e9 B	Notes
Qwen3-32B.Q4_K_H.gguf	Q4_K_H	17.9e9 B	IQ4_XS+ quality

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

steampunque
/

Qwen3-32B-Hybrid-GGUF

Llama.cpp hybrid layer quantization of Qwen3-32B by Alibaba

Download the file from below:

Model tree for steampunque/Qwen3-32B-Hybrid-GGUF