Permission to access the Unquantized versions of the QAT weights for Gemma-3

#4
by Joseph717171 - opened

Can we have access to the unquantized QAT weights (if they exist - perhaps I’m off-base), so we can perform our own quantizations as well? Also, with llama.cpp’s GGUF quantization scheme, it’s always best to compute an importance matrix (imatrix) to use during the quantization process. This helps the quantized weights be more accurate to their half- or full-precision counterparts—this is especially crucial for Round-To-Nearest (RTN) quants like GGUF’s. A quick look at any of the GGUF quantization repositories hosted by @bartowski will support this. Either way, great work! It’s because of efforts like these that we can enjoy larger models on devices with limited VRAM, which is always appreciated! 🙏

On a different note: why did you choose F16 for the embeddings over Q8_0? If the embeddings were quantized to Q8_0, the output tensors could also be quantized to Q8_0, potentially making the model more accurate (since output weights are more sensitive to quantization than embeddings). This is where an imatrix used during the quantization process really shines and helps mitigate more of the quantization-induced noise, which inevitably degrades the quantized model’s accuracy.

Some good honest work using Mean-Squared Deviation (MSD) has been done to compare BF16 vs. quantized models, with different quantization levels for embeddings, output tensors, and other weights. The data shows that output tensors are more sensitive to quantization than embeddings, and Feed-Forward Network (FFN) weights are the most sensitive of all. I should note, there’s a pull request in progress for llama.cpp (quantize: Handle user-defined quantization levels for additional tensors #12511) that will allow models to be quantized with user-specified per-tensor quantization schemes. For GGUF quants, it’s highly recommended to always quantize using a trained imatrix—try it yourself and see the difference it makes.

Information on llama.cpp’s imatrix and how to compute it:
Imatrix README
Dataset used by Bartowski for imatrix training
GitHub repo for the PR

msd_comparison_wide.png

Zoomed-In:
msd_comparison_zoomed.png

Mean-Squared Deviation of Error: Tensor Volatility Towards Quantization Induced Noise vs Different GGUF Standard/Custom Quantizations (GGUF) (lower values are better) (this is for better visual reference of the aforementioned points about different tensor weights being more volatile and sensitive to quantization):

Mean-Squared_Deviation_Testing_the_effect_of_quantizing_different_tensor_types-2.png
(Note: Llama-3.2-3B's Output tensor didn't show up in the data, so this is why it appears missing; its data is omitted due to it never having been there.)

Joseph717171 changed discussion title from Permission to access the unquantized version of the QAT weights to Permission to access the Unquantized versions of the QAT weights for Gemma-3
No description provided.

Hello, I did a few tests and I had the impression that the q4_K_M version, originally made available by Ollama, worked better than this qat-q4_0 provided by Google. To me, this graph you posted makes sense. Will we have a qat-q4_K_M version?
I'm using an Nvidia L4, Open Web UI and Ollama with the Gemma 3 12b 4q_K_M in production, and it works very well, but I'm limited to only 4 simultaneous requests. I would like to increase this limit.
Can I make the Gemma 3 27b model in 4 bits work with good performance (around 20 tokens/sec) in this L4, using Llama.cpp or vLLM, with multiple user requests? I confess that I'm having difficulties getting Gemma 3 to work in vLLM and Llama.cpp, it seems to me that they are not yet 100% optimized to use this model.

Hello, I did a few tests and I had the impression that the q4_K_M version, originally made available by Ollama, worked better than this qat-q4_0 provided by Google. To me, this graph you posted makes sense. Will we have a qat-q4_K_M version?
I'm using an Nvidia L4, Open Web UI and Ollama with the Gemma 3 12b 4q_K_M in production, and it works very well, but I'm limited to only 4 simultaneous requests. I would like to increase this limit.
Can I make the Gemma 3 27b model in 4 bits work with good performance (around 20 tokens/sec) in this L4, using Llama.cpp or vLLM, with multiple user requests? I confess that I'm having difficulties getting Gemma 3 to work in vLLM and Llama.cpp, it seems to me that they are not yet 100% optimized to use this model.

If Google lets us have the QAT weights to quantize from yes. 🤔

Yes you are very right about that, also these models are unnecessarily heavy to run, as the embeddings table is fp16. https://huggingface.co/stduhpf/google-gemma-3-12b-it-qat-q4_0-gguf-small They quanted it to q6_k and it runs much faster without a noticeable drop in quality at all. This might be a good alternative until Google provides more quants or access to the QAT model.

Yes please. Also we would appreciate more clarity to the QAT method. "per-channel int4, per-block int4, and switched fp8" Is this something published on kaggle?

Does Q4_0 share similarities with GPTQ with groupsize of 32? If we can losslessly convert to other formats found online, it would be awesome.

Google listened: https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-unquantized

@bartowski I wonder if quanting from these lead to higher quality of your quants as well.

Interesting.. worth a shot I suppose

I did a quick write-up on some initial results quantizing the QAT model: https://github.com/ikawrakow/ik_llama.cpp/discussions/334

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment