ubergarm/gemma-3-27b-it-qat-GGUF

`ik_llama.cpp` imatrix Quantizations of google/gemma-3-27b-it-qat-q4_0-unquantized

This quant collection REQUIRES ik_llama.cpp fork to support advanced non-linear SotA quants. Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community here and on r/LocalLLaMA for tips and tricks helping each other run all the fun new models!

Excited to share and learn together. Thanks!

Quant Collection

So far these are my best recipes offering the lowest perplexity per GiB models.

Check out this speed and quality comparison benchmarks graphs and discussion.

ubergarm/gemma-3-27B-it-qat-iq4_ks.gguf

Best Quality

32k context in 23704MiB VRAM
16k context in 19488MiB VRAM
8k context in 17380MiB VRAM
Only 13126MiB VRAM with -rtr -ot attn=CPU -nkvo
Could go q4_0 kv cache for lower VRAM usage!

14.099 GiB (4.484 BPW)
f32:  373 tensors
type q4_0:   62 tensors blk.*.attn_v.weight
type q8_0:    1 tensors
iq4_ks:  372 tensors
Final estimate: PPL = 8.1755 +/- 0.06296

ubergarm/gemma-3-27B-it-qat-mix-iq3_k.gguf

Smallest with Good Quality

32k context in 22306MiB VRAM
16k context in 18090MiB VRAM
8k context in 15982MiB VRAM
Only 11960MiB VRAM with -rtr -ot attn=CPU -nkvo
Could go q4_0 kv cache for lower VRAM usage!

12.733 GiB (4.050 BPW)
f32:  373 tensors
q4_0:   62 tensors blk.*.attn_v.weight
q8_0:    1 tensors token_embd.weight
iq3_k:  124 tensors ffn_(gate|up).weight
type iq4_ks:  248 tensors ffn_down.weight
Final estimate: PPL = 8.2367 +/- 0.06329

Quick Start

`ik_llama.cpp` API server for GPU inferencing

# This example for 24GB VRAM
./build/bin/llama-server \
    --alias ubergarm/gemma-3-27b-it-qat-mix-iq3_k.gguf \
    --model /mnt/raid/models/ubergarm/gemma-3-27b-it-qat-GGUF/gemma-3-27b-it-qat-iq4_ks.gguf \
    -ctk q8_0 -ctv q8_0 \
    -fa \
    -amb 512 \
    -fmoe \
    -c 32768 \
    -ub 512 \
    -ngl 99 \
    --threads 4 \
    --host 127.0.0.1 \
    --port 8080

If you want more context and/or less VRAM usage, you can try:

Smaller KV Cache quantization -ctk q4_0 -ctv q4_0
Runtime Repack for CPU inferencing, override attn tensors to CPU, disable KV offload -rtr -ot attn=CPU -nkvo.

ubergarm
/

gemma-3-27b-it-qat-GGUF

`ik_llama.cpp` imatrix Quantizations of google/gemma-3-27b-it-qat-q4_0-unquantized

Big Thanks

Quant Collection

ubergarm/gemma-3-27B-it-qat-iq4_ks.gguf

ubergarm/gemma-3-27B-it-qat-mix-iq3_k.gguf

Quick Start

`ik_llama.cpp` API server for GPU inferencing

References

Model tree for ubergarm/gemma-3-27b-it-qat-GGUF

ik_llama.cpp imatrix Quantizations of google/gemma-3-27b-it-qat-q4_0-unquantized

Big Thanks

Quant Collection

ubergarm/gemma-3-27B-it-qat-iq4_ks.gguf

ubergarm/gemma-3-27B-it-qat-mix-iq3_k.gguf

Quick Start

ik_llama.cpp API server for GPU inferencing

References

Model tree for ubergarm/gemma-3-27b-it-qat-GGUF

`ik_llama.cpp` imatrix Quantizations of google/gemma-3-27b-it-qat-q4_0-unquantized

`ik_llama.cpp` API server for GPU inferencing