nm-research commited on
Commit
4962ff6
·
verified ·
1 Parent(s): c537037

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +235 -3
README.md CHANGED
@@ -1,3 +1,235 @@
1
- ---
2
- license: gemma
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - vllm
4
+ - vision
5
+ - w4a16
6
+ license: gemma
7
+ base_model: google/gemma-3-4b-it
8
+ library_name: transformers
9
+ ---
10
+
11
+ # gemma-3-4b-it-quantized.w4a16
12
+
13
+ ## Model Overview
14
+ - **Model Architecture:** google/gemma-3-4b-it
15
+ - **Input:** Vision-Text
16
+ - **Output:** Text
17
+ - **Model Optimizations:**
18
+ - **Weight quantization:** INT4
19
+ - **Activation quantization:** FP16
20
+ - **Release Date:** 6/4/2025
21
+ - **Version:** 1.0
22
+ - **Model Developers:** RedHatAI
23
+
24
+ Quantized version of [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it).
25
+
26
+ ### Model Optimizations
27
+
28
+ This model was obtained by quantizing the weights of [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it) to INT4 data type, ready for inference with vLLM >= 0.8.0.
29
+
30
+ ## Deployment
31
+
32
+ ### Use with vLLM
33
+
34
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
35
+
36
+ ```python
37
+ from vllm.assets.image import ImageAsset
38
+ from vllm import LLM, SamplingParams
39
+
40
+ # prepare model
41
+ llm = LLM(
42
+ model="nm-testing/gemma-3-4b-it-quantized.w4a16",
43
+ trust_remote_code=True,
44
+ max_model_len=4096,
45
+ max_num_seqs=2,
46
+ )
47
+
48
+ # prepare inputs
49
+ question = "What is the content of this image?"
50
+ inputs = {
51
+ "prompt": f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n",
52
+ "multi_modal_data": {
53
+ "image": ImageAsset("cherry_blossom").pil_image.convert("RGB")
54
+ },
55
+ }
56
+
57
+ # generate response
58
+ print("========== SAMPLE GENERATION ==============")
59
+ outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))
60
+ print(f"PROMPT : {outputs[0].prompt}")
61
+ print(f"RESPONSE: {outputs[0].outputs[0].text}")
62
+ print("==========================================")
63
+ ```
64
+
65
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
66
+
67
+ ## Creation
68
+
69
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below:
70
+
71
+ <details>
72
+ <summary>Model Creation Code</summary>
73
+
74
+ ```python
75
+ import base64
76
+ from io import BytesIO
77
+ import torch
78
+ from datasets import load_dataset
79
+ from transformers import AutoProcessor, Gemma3ForConditionalGeneration
80
+ from llmcompressor.modifiers.quantization import GPTQModifier
81
+ from llmcompressor.transformers import oneshot
82
+
83
+
84
+ # Load model.
85
+ model_id = "google/gemma-3-4b-it"
86
+ model = Gemma3ForConditionalGeneration.from_pretrained(
87
+ model_id,
88
+ device_map="auto",
89
+ torch_dtype="auto",
90
+ )
91
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
92
+
93
+ # Oneshot arguments
94
+ DATASET_ID = "neuralmagic/calibration"
95
+ DATASET_SPLIT = {"LLM": "train[:512]"}
96
+ NUM_CALIBRATION_SAMPLES = 512
97
+ MAX_SEQUENCE_LENGTH = 2048
98
+
99
+ # Load dataset and preprocess.
100
+ ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
101
+ ds = ds.shuffle(seed=42)
102
+
103
+ dampening_frac=0.05
104
+
105
+ def data_collator(batch):
106
+ assert len(batch) == 1, "Only batch size of 1 is supported for calibration"
107
+ item = batch[0]
108
+ collated = {}
109
+ import torch
110
+
111
+
112
+ for key, value in item.items():
113
+ if isinstance(value, torch.Tensor):
114
+ collated[key] = value.unsqueeze(0)
115
+ elif isinstance(value, list) and isinstance(value[0][0], int):
116
+ # Handle tokenized inputs like input_ids, attention_mask
117
+ collated[key] = torch.tensor(value)
118
+ elif isinstance(value, list) and isinstance(value[0][0], float):
119
+ # Handle possible float sequences
120
+ collated[key] = torch.tensor(value)
121
+ elif isinstance(value, list) and isinstance(value[0][0], torch.Tensor):
122
+ # Handle batched image data (e.g., pixel_values as [C, H, W])
123
+ collated[key] = torch.stack(value) # -> [1, C, H, W]
124
+ elif isinstance(value, torch.Tensor):
125
+ collated[key] = value
126
+ else:
127
+ print(f"[WARN] Unrecognized type in collator for key={key}, type={type(value)}")
128
+
129
+ return collated
130
+
131
+
132
+
133
+ # Recipe
134
+ recipe = [
135
+ GPTQModifier(
136
+ targets="Linear",
137
+ scheme="W4A16",
138
+ ignore: ["re:.*lm_head.*", "re:.*embed_tokens.*", "re:vision_tower.*", "re:multi_modal_projector.*"],
139
+ sequential_update: True,
140
+ )
141
+ ]
142
+
143
+ SAVE_DIR=f"{model_id.split('/')[1]}-quantized.w4a16"
144
+
145
+ # Perform oneshot
146
+ oneshot(
147
+ model=model,
148
+ tokenizer=model_id,
149
+ dataset=ds,
150
+ recipe=recipe,
151
+ max_seq_length=MAX_SEQUENCE_LENGTH,
152
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
153
+ trust_remote_code_model=True,
154
+ data_collator=data_collator,
155
+ output_dir=SAVE_DIR
156
+ )
157
+ ```
158
+ </details>
159
+
160
+ ## Evaluation
161
+
162
+ The model was evaluated using [lm_evaluation_harness](https://github.com/neuralmagic/lm-evaluation-harness) for OpenLLM v1 text benchmark. The evaluations were conducted using the following commands:
163
+
164
+ <details>
165
+ <summary>Evaluation Commands</summary>
166
+
167
+ ### OpenLLM v1
168
+ ```
169
+ lm_eval \
170
+ --model vllm \
171
+ --model_args pretrained="<model_name>",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=<n>,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True,enforce_eager=True \
172
+ --tasks openllm \
173
+ --batch_size auto
174
+ ```
175
+ </details>
176
+
177
+
178
+ ### Accuracy
179
+
180
+ <table>
181
+ <thead>
182
+ <tr>
183
+ <th>Category</th>
184
+ <th>Metric</th>
185
+ <th>google/gemma-3-4b-it</th>
186
+ <th>nm-testing/gemma-3-4b-it-quantized.w4a16</th>
187
+ <th>Recovery (%)</th>
188
+ </tr>
189
+ </thead>
190
+ <tbody>
191
+ <tr>
192
+ <td rowspan="7"><b>OpenLLM V1</b></td>
193
+ <td>ARC Challenge</td>
194
+ <td>56.57%</td>
195
+ <td>56.57%</td>
196
+ <td>100.00%</td>
197
+ </tr>
198
+ <tr>
199
+ <td>GSM8K</td>
200
+ <td>76.12%</td>
201
+ <td>72.33%</td>
202
+ <td>95.02%</td>
203
+ </tr>
204
+ <tr>
205
+ <td>Hellaswag</td>
206
+ <td>74.96%</td>
207
+ <td>73.35%</td>
208
+ <td>97.86%</td>
209
+ </tr>
210
+ <tr>
211
+ <td>MMLU</td>
212
+ <td>58.38%</td>
213
+ <td>56.33%</td>
214
+ <td>96.49%</td>
215
+ </tr>
216
+ <tr>
217
+ <td>Truthfulqa (mc2)</td>
218
+ <td>51.87%</td>
219
+ <td>50.81%</td>
220
+ <td>97.96%</td>
221
+ </tr>
222
+ <tr>
223
+ <td>Winogrande</td>
224
+ <td>70.32%</td>
225
+ <td>68.82%</td>
226
+ <td>97.87%%</td>
227
+ </tr>
228
+ <tr>
229
+ <td><b>Average Score</b></td>
230
+ <td><b>64.70%</b></td>
231
+ <td><b>63.04%</b></td>
232
+ <td><b>97.42%</b></td>
233
+ </tr>
234
+ </tbody>
235
+ </table>