This is an HQQ-quantized version (4-bit, group-size=64) of the gemma-3-12b-it model.

Performance

Models bfp16 HQQ 4-bit gs-64 QAT 4-bit gs-32
ARC (25-shot) 0.724 0.701 0.690
HellaSwag (10-shot) 0.839 0.826 0.792
MMLU (5-shot) 0.730 0.724 0.693
TruthfulQA-MC2 0.580 0.585 0.550
Winogrande (5-shot) 0.766 0.774 0.755
GSM8K (5-shot) 0.874 0.862 0.808
Average 0.752 0.745 0.715

Usage

#use transformers up to 52cc204dd7fbd671452448028aae6262cea74dc2
#pip install git+https://github.com/huggingface/transformers@52cc204dd7fbd671452448028aae6262cea74dc2

import torch
backend       = "gemlite" 
compute_dtype = torch.bfloat16 
cache_dir     = None
model_id      = 'mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf'

#Load model
from transformers import Gemma3ForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained(model_id, cache_dir=cache_dir)
model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=compute_dtype,
    attn_implementation="sdpa",
    cache_dir=cache_dir,
    device_map="cuda",
)

#Optimize
from hqq.utils.patching import prepare_for_inference
prepare_for_inference(model.language_model, backend=backend, verbose=True)


############################################################################
#Inference
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=compute_dtype)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=128, do_sample=False)[0][input_len:]
    decoded    = processor.decode(generation, skip_special_tokens=True)

print(decoded)

Downloads last month
119
Safetensors
Model size
8.15B params
Tensor type
I64
BF16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for mobiuslabsgmbh/gemma-3-12b-it_4bitgs64_bfp16_hqq_hf

Finetuned
(34)
this model