ik_llma.cpp imatrix Quantizations of google/gemma-3-27b-it-qat-q4_0-unquantized

This quant collection REQUIRES ik_llama.cpp fork to support advanced non-linear SotA quants. Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community here and on r/LocalLLaMA for tips and tricks helping each other run all the fun new models!

Excited to share and learn together. Thanks!

Quant Collection

So far these are my best recipes offering the lowest perplexity per GiB models.

Check out this speed and quality comparison benchmarks graphs and discussion.

ubergarm/gemma-3-27B-it-qat-iq4_ks.gguf

Best Quality

  • 32k context in 23704MiB VRAM
  • 16k context in 19488MiB VRAM
  • 8k context in 17380MiB VRAM
  • Only 13126MiB VRAM with -rtr -ot attn=CPU -nkvo
  • Could go q4_0 kv cache for lower VRAM usage!
14.099 GiB (4.484 BPW)
f32:  373 tensors
type q4_0:   62 tensors blk.*.attn_v.weight
type q8_0:    1 tensors
iq4_ks:  372 tensors
Final estimate: PPL = 8.1755 +/- 0.06296

ubergarm/gemma-3-27B-it-qat-mix-iq3_k.gguf

Smallest with Good Quality

  • 32k context in 22306MiB VRAM
  • 16k context in 18090MiB VRAM
  • 8k context in 15982MiB VRAM
  • Only 11960MiB VRAM with -rtr -ot attn=CPU -nkvo
  • Could go q4_0 kv cache for lower VRAM usage!
12.733 GiB (4.050 BPW)
f32:  373 tensors
q4_0:   62 tensors blk.*.attn_v.weight
q8_0:    1 tensors token_embd.weight
iq3_k:  124 tensors ffn_(gate|up).weight
type iq4_ks:  248 tensors ffn_down.weight
Final estimate: PPL = 8.2367 +/- 0.06329

Quick Start

ik_llama.cpp API server for GPU inferencing

# This example for 24GB VRAM
./build/bin/llama-server \
    --alias ubergarm/gemma-3-27b-it-qat-mix-iq3_k.gguf \
    --model /mnt/raid/models/ubergarm/gemma-3-27b-it-qat-GGUF/gemma-3-27b-it-qat-iq4_ks.gguf \
    -ctk q8_0 -ctv q8_0 \
    -fa \
    -amb 512 \
    -fmoe \
    -c 32768 \
    -ub 512 \
    -ngl 99 \
    --threads 4 \
    --host 127.0.0.1 \
    --port 8080

If you want more context and/or less VRAM usage, you can try:

  • Smaller KV Cache quantization -ctk q4_0 -ctv q4_0
  • Runtime Repack for CPU inferencing, override attn tensors to CPU, disable KV offload -rtr -ot attn=CPU -nkvo.

References

Downloads last month
287
GGUF
Model size
27B params
Architecture
gemma3
Hardware compatibility
Log In to view the estimation

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ubergarm/gemma-3-27b-it-qat-GGUF

Quantized
(4)
this model