---
library_name: transformers
license: gemma
base_model: google/paligemma2-3b-pt-448
tags:
- generated_from_trainer
model-index:
- name: paligemma-architecture
  results: []
language:
- en
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# paligemma-architecture

This model is a fine-tuned version of [google/paligemma2-3b-pt-448](https://huggingface.co/google/paligemma2-3b-pt-448) on a custom architecture dataset (700 image description pairs).
This is my first model uploaded to HuggingFace.

## Training procedure

Followed the [notebook from smol-vision](https://github.com/merveenoyan/smol-vision/blob/main/Fine_tune_PaliGemma.ipynb), adjusted dataset loading and some parameters.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 8
- optimizer: Use OptimizerNames.ADAMW_HF with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 2
- num_epochs: 4

Approx. 30GB of GPU RAM, trained on Google colab's A100 

### Training results

TrainOutput(global_step=352,
training_loss=7.797419488430023,
metrics={
'train_runtime': 1653.6164,
'train_samples_per_second': 1.705,
'train_steps_per_second': 0.213,
'total_flos': 5.772661476596784e+16,
'train_loss': 7.797419488430023,
'epoch': 3.9645390070921986})

## Usage

Using a CUDA supported GPU:

```python
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
import torch
from PIL import Image
import requests

# Model and device
model_id = "lmajnaric/paligemma448_arch_finetune"
device = "cuda"

# Load image using path or url
url = "https://cms.guggenheim-bilbao.eus/uploads/2019/05/el-edificio-guggenheim-bilbao-1.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# image = Image.open("building.jpg")


# Load model and processor with bfloat16 precision
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=dtype,
    device_map=device,
).eval()

processor = AutoProcessor.from_pretrained(model_id)


# Create prompt
prompt = (
        "Describe this building's architectural style in detail. What are its key features? "
        "What period and region is this style associated with? What materials are predominantly "
        "used in this building? Describe any notable decorative elements, patterns, or ornaments. "
        "Describe the overall structure, including the shape, height, and any distinctive "
        "architectural elements like towers, domes, or facades. If the building has a name, "
        "please state it in the beginning."
    )

# Process inputs
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

# Generate text
with torch.inference_mode():
    generation = model.generate(
        **model_inputs, 
        max_new_tokens=256,
        do_sample=True,      # Enable sampling for more diverse outputs
        temperature=0.7,     # Control randomness (lower = more deterministic)
        top_p=0.9,
    )
    
    # Only decode the new tokens (not the prompt)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    
    print(decoded)
```

or CPU:

```python
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
import torch
from PIL import Image
import requests

# Model and device
model_id = "lmajnaric/paligemma448_arch_finetune"

# Load image using path or url
url = "https://cms.guggenheim-bilbao.eus/uploads/2019/05/el-edificio-guggenheim-bilbao-1.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# image = Image.open("building.jpg")


# Load model and processor with bfloat16 precision
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)


# Create prompt
prompt = (
        "Describe this building's architectural style in detail. What are its key features? "
        "What period and region is this style associated with? What materials are predominantly "
        "used in this building? Describe any notable decorative elements, patterns, or ornaments. "
        "Describe the overall structure, including the shape, height, and any distinctive "
        "architectural elements like towers, domes, or facades. If the building has a name, "
        "please state it in the beginning."
    )

# Process inputs
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

# Generate text
with torch.inference_mode():
    generation = model.generate(
        **model_inputs, 
        max_new_tokens=256,
        do_sample=True,      # Enable sampling for more diverse outputs
        temperature=0.7,     # Control randomness (lower = more deterministic)
        top_p=0.9,
    )
    
    # Only decode the new tokens (not the prompt)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    
    print(decoded)
```

### Framework versions

- Transformers 4.50.0.dev0
- Pytorch 2.6.0+cu124
- Datasets 3.4.0
- Tokenizers 0.21.0