Llama.cpp hybrid layer quantization of Qwen3-4B by Alibaba

Original model: https://huggingface.co/Qwen/Qwen3-4B

The hybrid quant employs different quantization levels on a per layer basis to increased flexibility of trading off performance vs file size. Less parameter bits are used at deep layers and more bits at cortex layers to simultaneously optimize quantized size and model performance. These quants were specifically optimized for the Qwen3 4B edge model for essentially no performance loss vs Q8_0 quant while reducing file size about 0.6G.

The layer quants are as follows:

   LAYER_TYPES='[
   [0 ,"Q8_0"  ],[1 ,"Q5_K_M"],[2 ,"Q5_K_M"],[3 ,"Q5_K_M"],[4 ,"Q5_K_M"],[5 ,"Q5_K_M"],
   [6 ,"Q5_K_M"],[7 ,"Q5_K_M"],[8, "Q5_K_M"],[9, "Q5_K_M"],[10,"Q5_K_M"],[11,"Q5_K_M"],
   [12,"Q6_K"  ],[13,"Q6_K"  ],[14,"Q6_K"  ],[15,"Q6_K"  ],[16,"Q6_K"  ],[17,"Q6_K"  ],
   [18,"Q6_K"  ],[19,"Q6_K"  ],[20,"Q6_K"  ],[21,"Q6_K"  ],[22,"Q6_K"  ],[23,"Q6_K"  ],
   [24,"Q8_0"  ],[25,"Q8_0"  ],[26,"Q8_0"  ],[27,"Q8_0"  ],[28,"Q8_0"  ],[29,"Q8_0"  ],
   [30,"Q8_0"  ],[31,"Q8_0"  ],[32,"Q8_0"  ],[33,"Q8_0"  ],[34,"Q8_0"  ],[35,"Q8_0"  ]
   ]'
   FLAGS="--token-embedding-type Q8_0 --output-tensor-type Q6_K"

These quants were select based on combined subjective and objective performance evaluations to give both high performance and reduced file size.

Comparison:

Quant size PPL Comment
Q8_0 4.3e9 13.2 default embed and output
Q8_0_H 3.6e9 13.1 Q8_0 embed Q6_K output

Full evals comparing Qwen3 4B Q8_0 and Q8_0_H are also available at https://huggingface.co/spaces/steampunque/benchlm

Download the file from below:

Link Type Size/e9 B Notes
Qwen3-4B.Q8_0_H.gguf Q8_0_H 3.6e9 B Q8_0 quality

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

Downloads last month
28
GGUF
Model size
4.02B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for steampunque/Qwen3-4B-Hybrid-GGUF

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Quantized
(122)
this model