Llama.cpp hybrid layer quantization of gemma-3-27b-it by Google

Original model: https://huggingface.co/google/gemma-3-27b-it

The hybrid quant employs different quantization levels on a per layer basis to increased flexibility of trading off performance vs file size. Less parameter bits are used at deep layers and more bits at cortex layers to simultaneously optimize quantized size and model performance. This quant was designed to be smaller than Q4_K_M size with better stability and performance while using all K-quants for fast CPU processing when partially offloaded. For this file the layer quants are as follows:

   LAYER_TYPES='[
   [0 ,"Q5_K_M"],[1 ,"Q4_K_S"],[2 ,"Q3_K_L"],[3 ,"Q3_K_L"],[4 ,"Q3_K_L"],[5 ,"Q3_K_L"],[6 ,"Q3_K_L"],[7 ,"Q3_K_L"],
   [8 ,"Q4_K_S"],[9 ,"Q3_K_L"],[10,"Q4_K_S"],[11,"Q3_K_L"],[12,"Q4_K_S"],[13,"Q3_K_L"],[14,"Q4_K_S"],[15,"Q3_K_L"],
   [16,"Q3_K_L"],[17,"Q4_K_S"],[18,"Q3_K_L"],[19,"Q4_K_S"],[20,"Q3_K_L"],[21,"Q4_K_S"],[22,"Q3_K_L"],[23,"Q4_K_S"],
   [24,"Q4_K_S"],[25,"Q3_K_L"],[26,"Q4_K_S"],[27,"Q3_K_L"],[28,"Q4_K_S"],[29,"Q3_K_L"],[30,"Q4_K_S"],[31,"Q3_K_L"],
   [32,"Q4_K_S"],[33,"Q4_K_S"],[34,"Q4_K_S"],[35,"Q4_K_S"],[36,"Q4_K_S"],[37,"Q4_K_S"],[38,"Q4_K_S"],[39,"Q4_K_S"],
   [40,"Q4_K_S"],[41,"Q4_K_S"],[42,"Q4_K_S"],[43,"Q4_K_S"],[44,"Q4_K_M"],[45,"Q4_K_M"],[46,"Q4_K_M"],[47,"Q4_K_M"],
   [48,"Q4_K_M"],[49,"Q4_K_M"],[50,"Q4_K_M"],[51,"Q4_K_M"],[52,"Q4_K_M"],[53,"Q4_K_M"],[54,"Q4_K_M"],[55,"Q4_K_M"],
   [56,"Q4_K_M"],[57,"Q4_K_M"],[58,"Q4_K_M"],[59,"Q5_K_S"],[60,"Q5_K_M"],[61,"Q6_K"]
   ]'
   FLAGS="--token-embedding-type Q6_K --output-tensor-type Q6_K"

The gemma 3 27B model has known issues with high dynamic range layer activations (pushing way outside F16 range). This makes the model very difficult to quantize without degrading performance. Extensive experimentation showed that the model could not be reduced to the range of IQ4_XS bit efficiency without noticeably degrading performance. Instead a tradeoff size right between IQ4_XS and Q4_K_M was found to achieve very strong performance. The quants were optimized for high reasoning, knowledge, and code performance with full generation stability across a range of test prompts. This model will lose knowledge, reasoning, and stability very quickly when quantizing layers <4b so using any smaller quant with it is not recommended. The qat (quantization aware training) version of the model was also tested and found to work noticeably worse on the same set of test prompts so it is not uploaded. Most likely the qat munged the weights enough to effectively pre-lobotomize the model so good performance through the use of hybrid layer quants becomes impossible. Note that in tests, the Q4_K_M quant was found to be unstable (no convergence on some prompts) while the Q4_K_H quant is fully stable across a range of test prompts.

Comparison:

Quant size PPL Comment
IQ4_XS 14.9e9 8.06
Q4_K_H 15.8e9 8.00 Q6_K embed and output, stable
Q4_K_M 16.5e9 8.01 Q4_K embed Q6_K output, unstable

Usage:

llama.cpp updated gemma 3 layer norm computations on b5577 and made another update on b5585 for possible underflow correction on Cuda. It is recommended to use version b5585 and above with Gemma 3 27B. Note the layer norms still may not be up to handling the wide dynamic range activations well since the straigthforward Q4_K_M quant is still found to be unstable on some test prompts at >b5585.

gemma-3 27b is a vision capable model. It can be used together with its multimedia projector layers to process images and text inputs and generate text outputs. The mmproj file is made available in this repository. To test vision mode follow the docs in the mtmd readme in the tools directory of the source tree https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd/README.md .

The model also uses sliding window attention. Use of llama.cpp b5554 and above is recommend for support of the SWA mode. If --swa-full flag is used, the old method of keeping all KV memory and masking out everything outside the SWA window is used. When using SWA, prompt cache capability is lost but the available context is greatly increased (around 5.5x bigger). There is a problem when using q8_0 KV cache format where some heavy computations are being pushed to CPU and prompt processing and token gen become unusably slow. This does not happen with f16 kV so it is recommended to stay with f16 kv until/ if this problem gets resolved. Related discussion in https://github.com/ggml-org/llama.cpp/issues/13747.

Benchmarks:

A full set of benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm

gemma-3-27b-it compares most closely with Mistral-Small-3.1-24B-Instruct-2503 available here:https://huggingface.co/steampunque/Mistral-Small-3.1-24B-Instruct-2503-Hybrid-GGUF . A short summary of some key evals comparing the two models is given here for convenience:

model gemma-3-27b-it Mistral-Small-3.1-24B-Instruct-2503
quant Q4_K_H Q4_K_H
alignment strict permissive
TEST
Winogrande 0.748 0.784
Lambada 0.742 0.798
Hellaswag 0.802 0.899
BoolQ 0.701 0.646
Jeopardy 0.830 0.740
GSM8K 0.964 0.940
Apple 0.850 0.820
Humaneval 0.890 0.853

Download the file from below:

Link Type Size/e9 B Notes
gemma-3-27b-it.Q4_K_H.gguf Q4_K_H 15.8e9B 0.7G smaller than Q4_K_M
gemma-3-27b-it.mmproj.gguf mmproj 0.86e9 B multimedia projector

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

Downloads last month
40
GGUF
Model size
423M params
Architecture
clip
Hardware compatibility
Log In to view the estimation
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for steampunque/gemma-3-27b-it-Hybrid-GGUF

Quantized
(89)
this model