google/gemma-3-12b-it-qat-q4_0-gguf · Permission to access the Unquantized versions of the QAT weights for Gemma-3

Apr 6

•

Can we have access to the unquantized QAT weights (if they exist - perhaps I’m off-base), so we can perform our own quantizations as well? Also, with llama.cpp’s GGUF quantization scheme, it’s always best to compute an importance matrix (imatrix) to use during the quantization process. This helps the quantized weights be more accurate to their half- or full-precision counterparts—this is especially crucial for Round-To-Nearest (RTN) quants like GGUF’s. A quick look at any of the GGUF quantization repositories hosted by @bartowski will support this. Either way, great work! It’s because of efforts like these that we can enjoy larger models on devices with limited VRAM, which is always appreciated! 🙏

On a different note: why did you choose F16 for the embeddings over Q8_0? If the embeddings were quantized to Q8_0, the output tensors could also be quantized to Q8_0, potentially making the model more accurate (since output weights are more sensitive to quantization than embeddings). This is where an imatrix used during the quantization process really shines and helps mitigate more of the quantization-induced noise, which inevitably degrades the quantized model’s accuracy.

Some good honest work using Mean-Squared Deviation (MSD) has been done to compare BF16 vs. quantized models, with different quantization levels for embeddings, output tensors, and other weights. The data shows that output tensors are more sensitive to quantization than embeddings, and Feed-Forward Network (FFN) weights are the most sensitive of all. I should note, there’s a pull request in progress for llama.cpp (quantize: Handle user-defined quantization levels for additional tensors #12511) that will allow models to be quantized with user-specified per-tensor quantization schemes. For GGUF quants, it’s highly recommended to always quantize using a trained imatrix—try it yourself and see the difference it makes.

Information on llama.cpp’s imatrix and how to compute it:
• Imatrix README
• Dataset used by Bartowski for imatrix training
• GitHub repo for the PR

Zoomed-In:

Mean-Squared Deviation of Error: Tensor Volatility Towards Quantization Induced Noise vs Different GGUF Standard/Custom Quantizations (GGUF) (lower values are better) (this is for better visual reference of the aforementioned points about different tensor weights being more volatile and sensitive to quantization):

(Note: Llama-3.2-3B's Output tensor didn't show up in the data, so this is why it appears missing; its data is omitted due to it never having been there.)

Joseph717171 changed discussion title from Permission to access the unquantized version of the QAT weights to Permission to access the Unquantized versions of the QAT weights for Gemma-3 Apr 6

Joseph717171

Apr 6

•

edited Apr 8

No description provided.

Lucena190

Apr 9

Hello, I did a few tests and I had the impression that the q4_K_M version, originally made available by Ollama, worked better than this qat-q4_0 provided by Google. To me, this graph you posted makes sense. Will we have a qat-q4_K_M version?
I'm using an Nvidia L4, Open Web UI and Ollama with the Gemma 3 12b 4q_K_M in production, and it works very well, but I'm limited to only 4 simultaneous requests. I would like to increase this limit.
Can I make the Gemma 3 27b model in 4 bits work with good performance (around 20 tokens/sec) in this L4, using Llama.cpp or vLLM, with multiple user requests? I confess that I'm having difficulties getting Gemma 3 to work in vLLM and Llama.cpp, it seems to me that they are not yet 100% optimized to use this model.

Joseph717171

Apr 9

Hello, I did a few tests and I had the impression that the q4_K_M version, originally made available by Ollama, worked better than this qat-q4_0 provided by Google. To me, this graph you posted makes sense. Will we have a qat-q4_K_M version?
I'm using an Nvidia L4, Open Web UI and Ollama with the Gemma 3 12b 4q_K_M in production, and it works very well, but I'm limited to only 4 simultaneous requests. I would like to increase this limit.
Can I make the Gemma 3 27b model in 4 bits work with good performance (around 20 tokens/sec) in this L4, using Llama.cpp or vLLM, with multiple user requests? I confess that I'm having difficulties getting Gemma 3 to work in vLLM and Llama.cpp, it seems to me that they are not yet 100% optimized to use this model.

If Google lets us have the QAT weights to quantize from yes. 🤔

Dampfinchen

Apr 9

•

edited Apr 9

Yes you are very right about that, also these models are unnecessarily heavy to run, as the embeddings table is fp16. https://huggingface.co/stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small They quanted it to q6_k and it runs much faster without a noticeable drop in quality at all. This might be a good alternative until Google provides more quants or access to the QAT model.

imi2

Apr 11

Yes please. Also we would appreciate more clarity to the QAT method. "per-channel int4, per-block int4, and switched fp8" Is this something published on kaggle?

Does Q4_0 share similarities with GPTQ with groupsize of 32? If we can losslessly convert to other formats found online, it would be awesome.

Dampfinchen

Apr 18

Google listened: https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-unquantized

@bartowski I wonder if quanting from these lead to higher quality of your quants as well.

bartowski

Apr 18

Interesting.. worth a shot I suppose

ubergarm

Apr 18

I did a quick write-up on some initial results quantizing the QAT model: https://github.com/ikawrakow/ik_llama.cpp/discussions/334

Renu11

Google org Apr 24

Hi @Joseph717171 , Exciting news! For those looking to customize the unquantized versions of the QAT models in various flavors are now available. You can explore them here: "https://huggingface.co/collections/google/gemma-3-qat-67ee61ccacbf2be4195c265b".

Lucena190

Apr 24

I'm looking forward to knowing the performance of the "unquantized" model quantized in NVFP4, using an Nvidia L4.