sh2orc commited on
Commit
f1b61c5
·
verified ·
1 Parent(s): 16d1e0b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -3
README.md CHANGED
@@ -1,3 +1,121 @@
1
- ---
2
- license: gemma
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ library_name: vllm
4
+ pipeline_tag: image-text-to-text
5
+ extra_gated_heading: Access Gemma on Hugging Face
6
+ extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and
7
+ agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging
8
+ Face and click below. Requests are processed immediately.
9
+ extra_gated_button_content: Acknowledge license
10
+ base_model: google/gemma-3-27b-it
11
+ ---
12
+
13
+ # FP8 Dynamic Quantized Gemma-3-27b-it
14
+
15
+ ### Features
16
+ - Image text to text
17
+ - Tool chain
18
+
19
+
20
+ ## 1. What FP8‑Dynamic Quantization Is
21
+ * **FP8 format**
22
+ * 8‑bit floating‑point (1 sign bit + 5 exponent bits + 2 mantissa bits).
23
+ * Drastically shrinks weight/activation size while keeping floating‑point behavior.
24
+ * **Dynamic scheme (`FP8_DYNAMIC`)**
25
+ * **Weights:** *static*, **per‑channel** quantization (each out‑feature channel has its own scale).
26
+ * **Activations:** *dynamic*, **per‑token** quantization (scales are recomputed on‑the‑fly for every input token).
27
+ * **RTN (Round‑To‑Nearest) PTQ**
28
+ * Post‑training; no back‑prop required.
29
+ * No calibration dataset needed because:
30
+ * Weights use symmetric RTN.
31
+ * Activations are quantized dynamically at inference time.
32
+
33
+ ## 2. Serving the FP8 Model with vLLM
34
+
35
+ ```
36
+ vllm serve BCCard/gemma-3-27b-it-FP8-Dynamic \
37
+ --tensor-parallel-size 4 \
38
+ --gpu-memory-utilization 0.9 \
39
+ --max-model-len 8192 \
40
+ --enforce-eager \
41
+ --api-key bccard \
42
+ --served-model-name gemma-3-27b-it
43
+ ```
44
+
45
+ ## 3. Quantization Code Walk‑Through (Shared Knowledges)
46
+
47
+ [LLM Compressor](https://github.com/vllm-project/llm-compressor) is an easy-to-use library for optimizing models for deployment with vllm, including:
48
+
49
+ Comprehensive set of quantization algorithms for weight-only and activation quantization
50
+ Seamless integration with Hugging Face models and repositories
51
+ safetensors-based file format compatible with vllm
52
+ Large model support via accelerate
53
+
54
+ ```
55
+ from transformers import AutoProcessor, Gemma3ForConditionalGeneration
56
+ from llmcompressor.modifiers.quantization import QuantizationModifier
57
+ from llmcompressor.transformers import oneshot
58
+
59
+ model_name = "google/gemma-3-27b-it"
60
+
61
+ processor = AutoProcessor.from_pretrained(model_name)
62
+ model = Gemma3ForConditionalGeneration.from_pretrained(model_name, device_map="auto", torch_dtype="auto", trust_remote_code=True)
63
+
64
+ recipe = QuantizationModifier(
65
+ targets="Linear",
66
+ scheme="FP8_DYNAMIC",
67
+ ignore=['re:.*lm_head', 're:vision_tower.*', 're:multi_modal_projector.*'],
68
+ )
69
+
70
+ SAVE_DIR = "gemma-3-27b-it-FP8-Dynamic"
71
+ oneshot(model=model, recipe=recipe, output_dir=SAVE_DIR)
72
+ processor.save_pretrained(SAVE_DIR)
73
+ ```
74
+
75
+ ## 4. Gemma 3 model card
76
+
77
+ **Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core)
78
+
79
+ **Terms of Use**: [Terms][terms]
80
+
81
+ **Authors**: Google DeepMind, BC Card (Quatization)
82
+
83
+ ### Description
84
+
85
+ Gemma is a family of lightweight, state-of-the-art open models from Google,
86
+ built from the same research and technology used to create the Gemini models.
87
+ Gemma 3 models are multimodal, handling text and image input and generating text
88
+ output, with open weights for both pre-trained variants and instruction-tuned
89
+ variants. Gemma 3 has a large, 128K context window, multilingual support in over
90
+ 140 languages, and is available in more sizes than previous versions. Gemma 3
91
+ models are well-suited for a variety of text generation and image understanding
92
+ tasks, including question answering, summarization, and reasoning. Their
93
+ relatively small size makes it possible to deploy them in environments with
94
+ limited resources such as laptops, desktops or your own cloud infrastructure,
95
+ democratizing access to state of the art AI models and helping foster innovation
96
+ for everyone.
97
+
98
+ ### Inputs and outputs
99
+
100
+ - **Input:**
101
+ - Text string, such as a question, a prompt, or a document to be summarized
102
+ - Images, normalized to 896 x 896 resolution and encoded to 256 tokens
103
+ each
104
+ - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and
105
+ 32K tokens for the 1B size
106
+
107
+ - **Output:**
108
+ - Generated text in response to the input, such as an answer to a
109
+ question, analysis of image content, or a summary of a document
110
+ - Total output context of 8192 tokens
111
+
112
+ ### Citation
113
+
114
+ ```none
115
+ @article{gemma_2025,
116
+ title={Gemma 3 FP8 Dynamic},
117
+ url={https://bccard.ai},
118
+ author={BC Card},
119
+ year={2025}
120
+ }
121
+ ```