Llama.cpp hybrid layer quantization of gemma-3-12b-it by Google
Original model: https://huggingface.co/google/gemma-3-12b-it
The hybrid quant employs different quantization levels on a per layer basis to increased flexibility of trading off performance vs file size. Less parameter bits are used at deep layers and more bits at cortex layers to simultaneously optimize quantized size and model performance. This quant was designed to approximately match IQ4_XS size and performance while using all K-quants for faster CPU processing when partially offloaded. For this file the layer quants are as follows:
LAYER_TYPES='[
[0 ,"Q4_K_M"],[1 ,"Q4_K_S"],[2 ,"Q3_K_L"],[3 ,"Q3_K_L"],[4 ,"Q3_K_L"],[5 ,"Q3_K_L"],[6 ,"Q3_K_L"],[7 ,"Q3_K_L"],
[8 ,"Q3_K_L"],[9 ,"Q3_K_L"],[10,"Q3_K_L"],[11,"Q3_K_L"],[12,"Q4_K_S"],[13,"Q3_K_L"],[14,"Q4_K_S"],[15,"Q3_K_L"],
[16,"Q4_K_S"],[17,"Q3_K_L"],[18,"Q4_K_S"],[19,"Q3_K_L"],[20,"Q4_K_S"],[21,"Q3_K_L"],[22,"Q4_K_S"],[23,"Q3_K_L"],
[24,"Q4_K_S"],[25,"Q4_K_S"],[26,"Q4_K_S"],[27,"Q4_K_S"],[28,"Q4_K_S"],[29,"Q4_K_S"],[30,"Q4_K_S"],[31,"Q4_K_S"],
[32,"Q4_K_M"],[33,"Q4_K_S"],[34,"Q4_K_M"],[35,"Q4_K_S"],[36,"Q4_K_M"],[37,"Q4_K_S"],[38,"Q4_K_S"],[39,"Q4_K_M"],
[40,"Q4_K_M"],[41,"Q4_K_M"],[42,"Q4_K_M"],[43,"Q4_K_M"],[44,"Q4_K_M"],[45,"Q4_K_M"],[46,"Q4_K_M"],[47,"Q5_K_M"]
]'
FLAGS="--token-embedding-type Q4_K --output-tensor-type Q6_K"
These quants were optimized for high reasoning + knowledge performance across a range of test prompts. This model appears to lose knowledge and reasoning very quickly when quantizing layers <4b so using any smaller quant with it is not recommended. The qat (quantization aware training) version of the model was also tested and found to work significantly worse on the same set of test prompts so it is not uploaded. Most likely the qat munged the weights enough to effectively pre-lobotomize the model so good performance through the use of hybrid quants becomes impossible.
Comparison:
Quant | size | PPL | Comment |
---|---|---|---|
IQ4_XS | 6.61e9 | 9.29 | default embed and output |
Q4_K_H | 6.67e9 | 9.11 | Q4_K embed Q6_K output |
Usage:
gemma-3 12b is a vision capable model. It can be used together with its multimedia projector layers to process images and text inputs and generate text outputs. The mmproj file is made available in this repository. To test vision mode follow the docs in the mtmd readme in the tools directory of the source tree https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd/README.md .
The model also uses sliding window attention. Use of llama.cpp b5554 and above is recommend for support of the SWA mode. If --swa-full flag is used, the old method of keeping all KV memory and masking out everything outside the SWA window is used. When using SWA, prompt cache capability is lost but the available context is greatly increased (around 5.5x bigger). A KV cache of ~55k tokens is available on a 12G VRAM GPU with SWA and a gemma 3 1b speculator loaded, or ~72k tokens with no speculator loaded. There is a problem when using q8_0 KV cache format where some heavy computations are being pushed to CPU and prompt processing and token gen become unusably slow. This does not happen with f16 kV so it is recommended to stay with f16 kv until/ if this problem gets resolved. Related discussion in https://github.com/ggml-org/llama.cpp/issues/13747.
Download the file from below:
Link | Type | Size/e9 B | Notes |
---|---|---|---|
gemma-3-12b-it.Q4_K_H.gguf | Q4_K_H | 6.67e9 B | ~IQ4_XS size |
gemma-3-12b-it.mmproj.gguf | mmproj | 0.85e9 B | multimedia projector |
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
- Downloads last month
- 125