sh2orc commited on
Commit
a623237
·
verified ·
1 Parent(s): d2d4730

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -3
README.md CHANGED
@@ -1,3 +1,91 @@
1
- ---
2
- license: gemma
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ library_name: vllm
4
+ pipeline_tag: image-text-to-text
5
+ extra_gated_heading: Access Gemma on Hugging Face
6
+ extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and
7
+ agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging
8
+ Face and click below. Requests are processed immediately.
9
+ extra_gated_button_content: Acknowledge license
10
+ base_model: google/gemma-3-12b-it
11
+ ---
12
+
13
+ # FP8 Dynamic Quantized Gemma-3-12b-it
14
+
15
+ ### Features
16
+ - Image text to text
17
+ - Tool chain
18
+
19
+
20
+ ## 1. What FP8‑Dynamic Quantization Is
21
+ * **FP8 format**
22
+ * 8‑bit floating‑point (1 sign bit + 5 exponent bits + 2 mantissa bits).
23
+ * Drastically shrinks weight/activation size while keeping floating‑point behavior.
24
+ * **Dynamic scheme (`FP8_DYNAMIC`)**
25
+ * **Weights:** *static*, **per‑channel** quantization (each out‑feature channel has its own scale).
26
+ * **Activations:** *dynamic*, **per‑token** quantization (scales are recomputed on‑the‑fly for every input token).
27
+ * **RTN (Round‑To‑Nearest) PTQ**
28
+ * Post‑training; no back‑prop required.
29
+ * No calibration dataset needed because:
30
+ * Weights use symmetric RTN.
31
+ * Activations are quantized dynamically at inference time.
32
+
33
+ ## 2. Serving the FP8 Model with vLLM
34
+
35
+ ```
36
+ vllm serve BCCard/gemma-3-12b-it-FP8-Dynamic \
37
+ --tensor-parallel-size 4 \
38
+ --gpu-memory-utilization 0.9 \
39
+ --max-model-len 8192 \
40
+ --enforce-eager \
41
+ --api-key bccard \
42
+ --served-model-name gemma-3-12b-it
43
+ ```
44
+
45
+ ## 3. Gemma 3 model card
46
+
47
+ **Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core)
48
+
49
+ **Terms of Use**: [Terms][terms]
50
+
51
+ **Authors**: Google DeepMind, BC Card (Quatization)
52
+
53
+ ### Description
54
+
55
+ Gemma is a family of lightweight, state-of-the-art open models from Google,
56
+ built from the same research and technology used to create the Gemini models.
57
+ Gemma 3 models are multimodal, handling text and image input and generating text
58
+ output, with open weights for both pre-trained variants and instruction-tuned
59
+ variants. Gemma 3 has a large, 128K context window, multilingual support in over
60
+ 140 languages, and is available in more sizes than previous versions. Gemma 3
61
+ models are well-suited for a variety of text generation and image understanding
62
+ tasks, including question answering, summarization, and reasoning. Their
63
+ relatively small size makes it possible to deploy them in environments with
64
+ limited resources such as laptops, desktops or your own cloud infrastructure,
65
+ democratizing access to state of the art AI models and helping foster innovation
66
+ for everyone.
67
+
68
+ ### Inputs and outputs
69
+
70
+ - **Input:**
71
+ - Text string, such as a question, a prompt, or a document to be summarized
72
+ - Images, normalized to 896 x 896 resolution and encoded to 256 tokens
73
+ each
74
+ - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and
75
+ 32K tokens for the 1B size
76
+
77
+ - **Output:**
78
+ - Generated text in response to the input, such as an answer to a
79
+ question, analysis of image content, or a summary of a document
80
+ - Total output context of 8192 tokens
81
+
82
+ ### Citation
83
+
84
+ ```none
85
+ @article{gemma_2025,
86
+ title={Gemma 3 FP8 Dynamic},
87
+ url={https://bccard.ai},
88
+ author={BC Card},
89
+ year={2025}
90
+ }
91
+ ```