lmajnaric
/

paligemma448_arch_finetune

@@ -7,6 +7,8 @@ tags:
 model-index:
 - name: paligemma-architecture
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -14,8 +16,8 @@ should probably proofread and complete it, then remove this comment. -->
 # paligemma-architecture
-This model is a fine-tuned version of [google/paligemma2-3b-pt-448](https://huggingface.co/google/paligemma2-3b-pt-448) on a custom architecture dataset.
 ## Training procedure
@@ -35,16 +37,137 @@ The following hyperparameters were used during training:
 - lr_scheduler_warmup_steps: 2
 - num_epochs: 4
 ### Training results
-TrainOutput(global_step=352, training_loss=7.797419488430023,
-metrics={'train_runtime': 1653.6164, 'train_samples_per_second': 1.705,
-'train_steps_per_second': 0.213, 'total_flos': 5.772661476596784e+16,
-'train_loss': 7.797419488430023, 'epoch': 3.9645390070921986})
 ### Framework versions
 - Transformers 4.50.0.dev0
 - Pytorch 2.6.0+cu124
 - Datasets 3.4.0
-- Tokenizers 0.21.0

 model-index:
 - name: paligemma-architecture
   results: []
+language:
+- en
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 # paligemma-architecture
+This model is a fine-tuned version of [google/paligemma2-3b-pt-448](https://huggingface.co/google/paligemma2-3b-pt-448) on a custom architecture dataset (700 image description pairs).
+This is my first model uploaded to HuggingFace.
 ## Training procedure
 - lr_scheduler_warmup_steps: 2
 - num_epochs: 4
+Approx. 30GB of GPU RAM, trained on Google colab's A100
 ### Training results
+TrainOutput(global_step=352,
+training_loss=7.797419488430023,
+metrics={
+'train_runtime': 1653.6164,
+'train_samples_per_second': 1.705,
+'train_steps_per_second': 0.213,
+'total_flos': 5.772661476596784e+16,
+'train_loss': 7.797419488430023,
+'epoch': 3.9645390070921986})
+## Usage
+Using a CUDA supported GPU:
+```python
+from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
+import torch
+from PIL import Image
+import requests
+# Model and device
+model_id = "lmajnaric/paligemma448_arch_finetune"
+device = "cuda"
+# Load image using path or url
+url = "https://cms.guggenheim-bilbao.eus/uploads/2019/05/el-edificio-guggenheim-bilbao-1.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+# image = Image.open("building.jpg")
+# Load model and processor with bfloat16 precision
+model = PaliGemmaForConditionalGeneration.from_pretrained(
+    model_id,
+    torch_dtype=dtype,
+    device_map=device,
+).eval()
+processor = AutoProcessor.from_pretrained(model_id)
+# Create prompt
+prompt = (
+        "Describe this building's architectural style in detail. What are its key features? "
+        "What period and region is this style associated with? What materials are predominantly "
+        "used in this building? Describe any notable decorative elements, patterns, or ornaments. "
+        "Describe the overall structure, including the shape, height, and any distinctive "
+        "architectural elements like towers, domes, or facades. If the building has a name, "
+        "please state it in the beginning."
+    )
+# Process inputs
+model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
+input_len = model_inputs["input_ids"].shape[-1]
+# Generate text
+with torch.inference_mode():
+    generation = model.generate(
+        **model_inputs,
+        max_new_tokens=256,
+        do_sample=True,      # Enable sampling for more diverse outputs
+        temperature=0.7,     # Control randomness (lower = more deterministic)
+        top_p=0.9,
+    )
+    # Only decode the new tokens (not the prompt)
+    generation = generation[0][input_len:]
+    decoded = processor.decode(generation, skip_special_tokens=True)
+    print(decoded)
+```
+or CPU:
+```python
+from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
+import torch
+from PIL import Image
+import requests
+# Model and device
+model_id = "lmajnaric/paligemma448_arch_finetune"
+# Load image using path or url
+url = "https://cms.guggenheim-bilbao.eus/uploads/2019/05/el-edificio-guggenheim-bilbao-1.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+# image = Image.open("building.jpg")
+# Load model and processor with bfloat16 precision
+model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
+processor = AutoProcessor.from_pretrained(model_id)
+# Create prompt
+prompt = (
+        "Describe this building's architectural style in detail. What are its key features? "
+        "What period and region is this style associated with? What materials are predominantly "
+        "used in this building? Describe any notable decorative elements, patterns, or ornaments. "
+        "Describe the overall structure, including the shape, height, and any distinctive "
+        "architectural elements like towers, domes, or facades. If the building has a name, "
+        "please state it in the beginning."
+    )
+# Process inputs
+model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
+input_len = model_inputs["input_ids"].shape[-1]
+# Generate text
+with torch.inference_mode():
+    generation = model.generate(
+        **model_inputs,
+        max_new_tokens=256,
+        do_sample=True,      # Enable sampling for more diverse outputs
+        temperature=0.7,     # Control randomness (lower = more deterministic)
+        top_p=0.9,
+    )
+    # Only decode the new tokens (not the prompt)
+    generation = generation[0][input_len:]
+    decoded = processor.decode(generation, skip_special_tokens=True)
+    print(decoded)
+```
 ### Framework versions
 - Transformers 4.50.0.dev0
 - Pytorch 2.6.0+cu124
 - Datasets 3.4.0
+- Tokenizers 0.21.0