ik_llma.cpp
imatrix Quantizations of google/gemma-3-27b-it-qat-q4_0-unquantized
This quant collection REQUIRES ik_llama.cpp fork to support advanced non-linear SotA quants. Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
These quants provide best in class perplexity for the given memory footprint.
Big Thanks
Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!
Also thanks to all the folks in the quanting and inferencing community here and on r/LocalLLaMA
for tips and tricks helping each other run all the fun new models!
Excited to share and learn together. Thanks!
Quant Collection
So far these are my best recipes offering the lowest perplexity per GiB models.
Check out this speed and quality comparison benchmarks graphs and discussion.
ubergarm/gemma-3-27B-it-qat-iq4_ks.gguf
Best Quality
- 32k context in 23704MiB VRAM
- 16k context in 19488MiB VRAM
- 8k context in 17380MiB VRAM
- Only 13126MiB VRAM with
-rtr -ot attn=CPU -nkvo
- Could go
q4_0
kv cache for lower VRAM usage!
14.099 GiB (4.484 BPW)
f32: 373 tensors
type q4_0: 62 tensors blk.*.attn_v.weight
type q8_0: 1 tensors
iq4_ks: 372 tensors
Final estimate: PPL = 8.1755 +/- 0.06296
ubergarm/gemma-3-27B-it-qat-mix-iq3_k.gguf
Smallest with Good Quality
- 32k context in 22306MiB VRAM
- 16k context in 18090MiB VRAM
- 8k context in 15982MiB VRAM
- Only 11960MiB VRAM with
-rtr -ot attn=CPU -nkvo
- Could go
q4_0
kv cache for lower VRAM usage!
12.733 GiB (4.050 BPW)
f32: 373 tensors
q4_0: 62 tensors blk.*.attn_v.weight
q8_0: 1 tensors token_embd.weight
iq3_k: 124 tensors ffn_(gate|up).weight
type iq4_ks: 248 tensors ffn_down.weight
Final estimate: PPL = 8.2367 +/- 0.06329
Quick Start
ik_llama.cpp
API server for GPU inferencing
# This example for 24GB VRAM
./build/bin/llama-server \
--alias ubergarm/gemma-3-27b-it-qat-mix-iq3_k.gguf \
--model /mnt/raid/models/ubergarm/gemma-3-27b-it-qat-GGUF/gemma-3-27b-it-qat-iq4_ks.gguf \
-ctk q8_0 -ctv q8_0 \
-fa \
-amb 512 \
-fmoe \
-c 32768 \
-ub 512 \
-ngl 99 \
--threads 4 \
--host 127.0.0.1 \
--port 8080
If you want more context and/or less VRAM usage, you can try:
- Smaller KV Cache quantization
-ctk q4_0 -ctv q4_0
- Runtime Repack for CPU inferencing, override attn tensors to CPU, disable KV offload
-rtr -ot attn=CPU -nkvo
.
References
- Downloads last month
- 287
3-bit
4-bit
Model tree for ubergarm/gemma-3-27b-it-qat-GGUF
Base model
google/gemma-3-27b-pt