This is an HQQ-quantized version (4-bit, group-size=64) of the gemma-3-12b-it model.

Performance

Models bfp16 HQQ 4-bit gs-64 QAT 4-bit gs-32
ARC (25-shot) 0.724 0.701 0.690
HellaSwag (10-shot) 0.839 0.826 0.792
MMLU (5-shot) 0.730 0.724 0.693
TruthfulQA-MC2 0.580 0.585 0.550
Winogrande (5-shot) 0.766 0.774 0.755
GSM8K (5-shot) 0.874 0.862 0.808
Average 0.752 0.745 0.715

Usage

#use transformers up to 52cc204dd7fbd671452448028aae6262cea74dc2
#pip install git+https://github.com/huggingface/transformers@52cc204dd7fbd671452448028aae6262cea74dc2

import torch
backend       = "gemlite" 
compute_dtype = torch.bfloat16 
cache_dir     = None
model_id      = 'mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf'

#Load model
from transformers import Gemma3ForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained(model_id, cache_dir=cache_dir)
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=compute_dtype,
    attn_implementation="sdpa",
    cache_dir=cache_dir,
    device_map="cuda",
)

#Optimize
from hqq.utils.patching import prepare_for_inference
prepare_for_inference(model.language_model, backend=backend, verbose=True)


############################################################################
#Inference
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=compute_dtype)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=128, do_sample=False)[0][input_len:]
    decoded    = processor.decode(generation, skip_special_tokens=True)

print(decoded)

Downloads last month
9
Safetensors
Model size
8.15B params
Tensor type
I64
BF16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf

Finetuned
(74)
this model