lmajnaric commited on
Commit
c07938e
·
verified ·
1 Parent(s): ea0c9f9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -7
README.md CHANGED
@@ -7,6 +7,8 @@ tags:
7
  model-index:
8
  - name: paligemma-architecture
9
  results: []
 
 
10
  ---
11
 
12
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -14,8 +16,8 @@ should probably proofread and complete it, then remove this comment. -->
14
 
15
  # paligemma-architecture
16
 
17
- This model is a fine-tuned version of [google/paligemma2-3b-pt-448](https://huggingface.co/google/paligemma2-3b-pt-448) on a custom architecture dataset.
18
-
19
 
20
  ## Training procedure
21
 
@@ -35,16 +37,137 @@ The following hyperparameters were used during training:
35
  - lr_scheduler_warmup_steps: 2
36
  - num_epochs: 4
37
 
 
 
38
  ### Training results
39
 
40
- TrainOutput(global_step=352, training_loss=7.797419488430023,
41
- metrics={'train_runtime': 1653.6164, 'train_samples_per_second': 1.705,
42
- 'train_steps_per_second': 0.213, 'total_flos': 5.772661476596784e+16,
43
- 'train_loss': 7.797419488430023, 'epoch': 3.9645390070921986})
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  ### Framework versions
46
 
47
  - Transformers 4.50.0.dev0
48
  - Pytorch 2.6.0+cu124
49
  - Datasets 3.4.0
50
- - Tokenizers 0.21.0
 
7
  model-index:
8
  - name: paligemma-architecture
9
  results: []
10
+ language:
11
+ - en
12
  ---
13
 
14
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
16
 
17
  # paligemma-architecture
18
 
19
+ This model is a fine-tuned version of [google/paligemma2-3b-pt-448](https://huggingface.co/google/paligemma2-3b-pt-448) on a custom architecture dataset (700 image description pairs).
20
+ This is my first model uploaded to HuggingFace.
21
 
22
  ## Training procedure
23
 
 
37
  - lr_scheduler_warmup_steps: 2
38
  - num_epochs: 4
39
 
40
+ Approx. 30GB of GPU RAM, trained on Google colab's A100
41
+
42
  ### Training results
43
 
44
+ TrainOutput(global_step=352,
45
+ training_loss=7.797419488430023,
46
+ metrics={
47
+ 'train_runtime': 1653.6164,
48
+ 'train_samples_per_second': 1.705,
49
+ 'train_steps_per_second': 0.213,
50
+ 'total_flos': 5.772661476596784e+16,
51
+ 'train_loss': 7.797419488430023,
52
+ 'epoch': 3.9645390070921986})
53
+
54
+ ## Usage
55
+
56
+ Using a CUDA supported GPU:
57
+
58
+ ```python
59
+ from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
60
+ import torch
61
+ from PIL import Image
62
+ import requests
63
+
64
+ # Model and device
65
+ model_id = "lmajnaric/paligemma448_arch_finetune"
66
+ device = "cuda"
67
+
68
+ # Load image using path or url
69
+ url = "https://cms.guggenheim-bilbao.eus/uploads/2019/05/el-edificio-guggenheim-bilbao-1.jpg"
70
+ image = Image.open(requests.get(url, stream=True).raw)
71
+ # image = Image.open("building.jpg")
72
+
73
+
74
+ # Load model and processor with bfloat16 precision
75
+ model = PaliGemmaForConditionalGeneration.from_pretrained(
76
+ model_id,
77
+ torch_dtype=dtype,
78
+ device_map=device,
79
+ ).eval()
80
+
81
+ processor = AutoProcessor.from_pretrained(model_id)
82
+
83
+
84
+ # Create prompt
85
+ prompt = (
86
+ "Describe this building's architectural style in detail. What are its key features? "
87
+ "What period and region is this style associated with? What materials are predominantly "
88
+ "used in this building? Describe any notable decorative elements, patterns, or ornaments. "
89
+ "Describe the overall structure, including the shape, height, and any distinctive "
90
+ "architectural elements like towers, domes, or facades. If the building has a name, "
91
+ "please state it in the beginning."
92
+ )
93
+
94
+ # Process inputs
95
+ model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
96
+ input_len = model_inputs["input_ids"].shape[-1]
97
+
98
+ # Generate text
99
+ with torch.inference_mode():
100
+ generation = model.generate(
101
+ **model_inputs,
102
+ max_new_tokens=256,
103
+ do_sample=True, # Enable sampling for more diverse outputs
104
+ temperature=0.7, # Control randomness (lower = more deterministic)
105
+ top_p=0.9,
106
+ )
107
+
108
+ # Only decode the new tokens (not the prompt)
109
+ generation = generation[0][input_len:]
110
+ decoded = processor.decode(generation, skip_special_tokens=True)
111
+
112
+ print(decoded)
113
+ ```
114
+
115
+ or CPU:
116
+
117
+ ```python
118
+ from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
119
+ import torch
120
+ from PIL import Image
121
+ import requests
122
+
123
+ # Model and device
124
+ model_id = "lmajnaric/paligemma448_arch_finetune"
125
+
126
+ # Load image using path or url
127
+ url = "https://cms.guggenheim-bilbao.eus/uploads/2019/05/el-edificio-guggenheim-bilbao-1.jpg"
128
+ image = Image.open(requests.get(url, stream=True).raw)
129
+ # image = Image.open("building.jpg")
130
+
131
+
132
+ # Load model and processor with bfloat16 precision
133
+ model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
134
+ processor = AutoProcessor.from_pretrained(model_id)
135
+
136
+
137
+ # Create prompt
138
+ prompt = (
139
+ "Describe this building's architectural style in detail. What are its key features? "
140
+ "What period and region is this style associated with? What materials are predominantly "
141
+ "used in this building? Describe any notable decorative elements, patterns, or ornaments. "
142
+ "Describe the overall structure, including the shape, height, and any distinctive "
143
+ "architectural elements like towers, domes, or facades. If the building has a name, "
144
+ "please state it in the beginning."
145
+ )
146
+
147
+ # Process inputs
148
+ model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
149
+ input_len = model_inputs["input_ids"].shape[-1]
150
+
151
+ # Generate text
152
+ with torch.inference_mode():
153
+ generation = model.generate(
154
+ **model_inputs,
155
+ max_new_tokens=256,
156
+ do_sample=True, # Enable sampling for more diverse outputs
157
+ temperature=0.7, # Control randomness (lower = more deterministic)
158
+ top_p=0.9,
159
+ )
160
+
161
+ # Only decode the new tokens (not the prompt)
162
+ generation = generation[0][input_len:]
163
+ decoded = processor.decode(generation, skip_special_tokens=True)
164
+
165
+ print(decoded)
166
+ ```
167
 
168
  ### Framework versions
169
 
170
  - Transformers 4.50.0.dev0
171
  - Pytorch 2.6.0+cu124
172
  - Datasets 3.4.0
173
+ - Tokenizers 0.21.0