leon-se/gemma-3-27b-it-FP8-Dynamic

Apr 11

Hi!
Did you use the example code in the llm-compressor docs for this one? which one?

Thanks a lot! I tried using the llama example but i didn't copy the pre-processor files and I think i doesn't support the vision model

leon-se

Owner Apr 11

I'm using this code:

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot

model_name = "google/gemma-3-27b-it"

processor = AutoProcessor.from_pretrained(model_name)
model = Gemma3ForConditionalGeneration.from_pretrained(model_name, device_map="cpu", torch_dtype="auto", trust_remote_code=True)

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=['re:.*lm_head', 're:vision_tower.*', 're:multi_modal_projector.*'],
)

SAVE_DIR = "gemma-3-27b-it-FP8-Dynamic"
oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

bullerwins

Apr 11

How did you come up to know that 're:vision_tower.*', 're:multi_modal_projector.* was necessary?

leon-se

Owner Apr 11

For most VLMs it is common practice to only quantize the LLM layers and keeping the vision encoder and adapter/projector at FP16. I haven't compared this for Gemma3 but it doesn't require a lot of additional memory as the LLM is a lot bigger than the vision layers.

leon-se

Owner Apr 11

And if you want to find out the corresponding vision layer names for any model you will find them in its model.safetensors.index.json

leon-se
/

gemma-3-27b-it-FP8-Dynamic

Quantization code