Quantization code

#3
by bullerwins - opened

Hi!
Did you use the example code in the llm-compressor docs for this one? which one?

Thanks a lot! I tried using the llama example but i didn't copy the pre-processor files and I think i doesn't support the vision model

Owner

I'm using this code:

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot

model_name = "google/gemma-3-27b-it"

processor = AutoProcessor.from_pretrained(model_name)
model = Gemma3ForConditionalGeneration.from_pretrained(model_name, device_map="cpu", torch_dtype="auto", trust_remote_code=True)

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=['re:.*lm_head', 're:vision_tower.*', 're:multi_modal_projector.*'],
)

SAVE_DIR = "gemma-3-27b-it-FP8-Dynamic"
oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

How did you come up to know that 're:vision_tower.*', 're:multi_modal_projector.* was necessary?

Owner

For most VLMs it is common practice to only quantize the LLM layers and keeping the vision encoder and adapter/projector at FP16. I haven't compared this for Gemma3 but it doesn't require a lot of additional memory as the LLM is a lot bigger than the vision layers.

Owner

And if you want to find out the corresponding vision layer names for any model you will find them in its model.safetensors.index.json

Sign up or log in to comment