SpiridonSunRotator commited on
Commit
b2c7f39
·
verified ·
1 Parent(s): 994b222

Added evaluation metrics and usage example

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ tags:
6
+ - int4
7
+ - vllm
8
+ - llmcompressor
9
+ base_model: google/gemma-3-12b-it
10
+ ---
11
+
12
+ # gemma-3-12b-it-GPTQ-4b-128g
13
+
14
+ ## Model Overview
15
+
16
+ This model was obtained by quantizing the weights of [gemma-3-12b-it](https://huggingface.co/google/gemma-3-12b-it) to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
17
+
18
+ Only the weights of the linear operators within `language_model` transformers blocks are quantized. Vision model and multimodal projection are kept in original precision. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization.
19
+
20
+ Model checkpoint is saved in [compressed_tensors](https://github.com/neuralmagic/compressed-tensors) format.
21
+
22
+ ## Evaluation
23
+
24
+ This model was evaluated on the OpenLLM v1 benchmarks. Model outputs were generated with the `vLLM` engine.
25
+
26
+ | Model | ArcC | GSM8k | Hellaswag | MMLU | TruthfulQA-mc2 | Winogrande | Average | Recovery |
27
+ |----------------------------|:------:|:------:|:---------:|:------:|:--------------:|:----------:|:-------:|:--------:|
28
+ | gemma-3-12b-it | 0.7125 | 0.8719 | 0.8377 | 0.7230 | 0.5798 | 0.7893 | 0.7524 | 1.0000 |
29
+ | gemma-3-12b-it-INT4 (this) | 0.6988 | 0.8643 | 0.8254 | 0.7078 | 0.5638 | 0.7830 | 0.7405 | 0.9842 |
30
+
31
+ ## Reproduction
32
+
33
+ The results were obtained using the following commands:
34
+
35
+ ```bash
36
+ MODEL=ISTA-DASLab/gemma-3-12b-it-GPTQ-4b-128g
37
+ MODEL_ARGS="pretrained=$MODEL,max_model_len=4096,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.80"
38
+
39
+ lm_eval \
40
+ --model vllm \
41
+ --model_args $MODEL_ARGS \
42
+ --tasks openllm \
43
+ --batch_size auto
44
+ ```
45
+
46
+
47
+ ## Usage
48
+
49
+ * To use the model in `transformers` update the package to stable release of Gemma3:
50
+
51
+ `pip install git+https://github.com/huggingface/[email protected]`
52
+ * To use the model in `vLLM` update the package to version after this [PR](https://github.com/vllm-project/vllm/pull/14660/files).
53
+
54
+ And example of inference via transformers is provided below:
55
+
56
+ ```python
57
+ # pip install accelerate
58
+
59
+ from transformers import AutoProcessor, Gemma3ForConditionalGeneration
60
+ from PIL import Image
61
+ import requests
62
+ import torch
63
+
64
+ model_id = "ISTA-DASLab/gemma-3-12b-it-GPTQ-4b-128g"
65
+
66
+ model = Gemma3ForConditionalGeneration.from_pretrained(
67
+ model_id, device_map="auto"
68
+ ).eval()
69
+
70
+ processor = AutoProcessor.from_pretrained(model_id)
71
+
72
+ messages = [
73
+ {
74
+ "role": "system",
75
+ "content": [{"type": "text", "text": "You are a helpful assistant."}]
76
+ },
77
+ {
78
+ "role": "user",
79
+ "content": [
80
+ {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
81
+ {"type": "text", "text": "Describe this image in detail."}
82
+ ]
83
+ }
84
+ ]
85
+
86
+ inputs = processor.apply_chat_template(
87
+ messages, add_generation_prompt=True, tokenize=True,
88
+ return_dict=True, return_tensors="pt"
89
+ ).to(model.device, dtype=torch.bfloat16)
90
+
91
+ input_len = inputs["input_ids"].shape[-1]
92
+
93
+ with torch.inference_mode():
94
+ generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
95
+ generation = generation[0][input_len:]
96
+
97
+ decoded = processor.decode(generation, skip_special_tokens=True)
98
+ print(decoded)
99
+
100
+ # **Overall Impression:** The image is a close-up shot of a vibrant garden scene,
101
+ # focusing on a cluster of pink cosmos flowers and a busy bumblebee.
102
+ # It has a slightly soft, natural feel, likely captured in daylight.
103
+ ```