Quantization code
#3
by
bullerwins
- opened
Hi!
Did you use the example code in the llm-compressor docs for this one? which one?
Thanks a lot! I tried using the llama example but i didn't copy the pre-processor files and I think i doesn't support the vision model
I'm using this code:
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
model_name = "google/gemma-3-27b-it"
processor = AutoProcessor.from_pretrained(model_name)
model = Gemma3ForConditionalGeneration.from_pretrained(model_name, device_map="cpu", torch_dtype="auto", trust_remote_code=True)
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=['re:.*lm_head', 're:vision_tower.*', 're:multi_modal_projector.*'],
)
SAVE_DIR = "gemma-3-27b-it-FP8-Dynamic"
oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
processor.save_pretrained(SAVE_DIR)
How did you come up to know that 're:vision_tower.*', 're:multi_modal_projector.*
was necessary?
For most VLMs it is common practice to only quantize the LLM layers and keeping the vision encoder and adapter/projector at FP16. I haven't compared this for Gemma3 but it doesn't require a lot of additional memory as the LLM is a lot bigger than the vision layers.
And if you want to find out the corresponding vision layer names for any model you will find them in its model.safetensors.index.json