BCCard
/

gemma-3-27b-it-FP8-Dynamic

+---
+license: gemma
+library_name: vllm
+pipeline_tag: image-text-to-text
+extra_gated_heading: Access Gemma on Hugging Face
+extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and
+  agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging
+  Face and click below. Requests are processed immediately.
+extra_gated_button_content: Acknowledge license
+base_model: google/gemma-3-27b-it
+---
+# FP8 Dynamic Quantized Gemma-3-27b-it
+### Features
+- Image text to text
+- Tool chain
+## 1. What FP8‑Dynamic Quantization Is
+* **FP8 format**
+  * 8‑bit floating‑point (1 sign bit + 5 exponent bits + 2 mantissa bits).
+  * Drastically shrinks weight/activation size while keeping floating‑point behavior.
+* **Dynamic scheme (`FP8_DYNAMIC`)**
+  * **Weights:** *static*, **per‑channel** quantization (each out‑feature channel has its own scale).
+  * **Activations:** *dynamic*, **per‑token** quantization (scales are recomputed on‑the‑fly for every input token).
+* **RTN (Round‑To‑Nearest) PTQ**
+  * Post‑training; no back‑prop required.
+  * No calibration dataset needed because:
+    * Weights use symmetric RTN.
+    * Activations are quantized dynamically at inference time.
+## 2. Serving the FP8 Model with vLLM
+```
+vllm serve BCCard/gemma-3-27b-it-FP8-Dynamic \
+  --tensor-parallel-size 4 \
+  --gpu-memory-utilization 0.9  \
+  --max-model-len 8192 \
+  --enforce-eager \
+  --api-key bccard \
+  --served-model-name gemma-3-27b-it
+```
+## 3. Quantization Code Walk‑Through (Shared Knowledges)
+[LLM Compressor](https://github.com/vllm-project/llm-compressor) is an easy-to-use library for optimizing models for deployment with vllm, including:
+Comprehensive set of quantization algorithms for weight-only and activation quantization
+Seamless integration with Hugging Face models and repositories
+safetensors-based file format compatible with vllm
+Large model support via accelerate
+```
+from transformers import AutoProcessor, Gemma3ForConditionalGeneration
+from llmcompressor.modifiers.quantization import QuantizationModifier
+from llmcompressor.transformers import oneshot
+model_name = "google/gemma-3-27b-it"
+processor = AutoProcessor.from_pretrained(model_name)
+model = Gemma3ForConditionalGeneration.from_pretrained(model_name, device_map="auto", torch_dtype="auto", trust_remote_code=True)
+recipe = QuantizationModifier(
+    targets="Linear",
+    scheme="FP8_DYNAMIC",
+    ignore=['re:.*lm_head', 're:vision_tower.*', 're:multi_modal_projector.*'],
+)
+SAVE_DIR = "gemma-3-27b-it-FP8-Dynamic"
+oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
+processor.save_pretrained(SAVE_DIR)
+```
+## 4. Gemma 3 model card
+**Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core)
+**Terms of Use**: [Terms][terms]
+**Authors**: Google DeepMind, BC Card (Quatization)
+### Description
+Gemma is a family of lightweight, state-of-the-art open models from Google,
+built from the same research and technology used to create the Gemini models.
+Gemma 3 models are multimodal, handling text and image input and generating text
+output, with open weights for both pre-trained variants and instruction-tuned
+variants. Gemma 3 has a large, 128K context window, multilingual support in over
+140 languages, and is available in more sizes than previous versions. Gemma 3
+models are well-suited for a variety of text generation and image understanding
+tasks, including question answering, summarization, and reasoning. Their
+relatively small size makes it possible to deploy them in environments with
+limited resources such as laptops, desktops or your own cloud infrastructure,
+democratizing access to state of the art AI models and helping foster innovation
+for everyone.
+### Inputs and outputs
+-   **Input:**
+    -  Text string, such as a question, a prompt, or a document to be summarized
+    -  Images, normalized to 896 x 896 resolution and encoded to 256 tokens
+       each
+    -  Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and
+       32K tokens for the 1B size
+-   **Output:**
+    -   Generated text in response to the input, such as an answer to a
+        question, analysis of image content, or a summary of a document
+    -   Total output context of 8192 tokens
+### Citation
+```none
+@article{gemma_2025,
+    title={Gemma 3 FP8 Dynamic},
+    url={https://bccard.ai},
+    author={BC Card},
+    year={2025}
+}
+```