--- library_name: transformers license: gemma base_model: google/paligemma2-3b-pt-448 tags: - generated_from_trainer model-index: - name: paligemma-architecture results: [] language: - en --- # paligemma-architecture This model is a fine-tuned version of [google/paligemma2-3b-pt-448](https://huggingface.co/google/paligemma2-3b-pt-448) on a custom architecture dataset (700 image description pairs). This is my first model uploaded to HuggingFace. ## Training procedure Followed the [notebook from smol-vision](https://github.com/merveenoyan/smol-vision/blob/main/Fine_tune_PaliGemma.ipynb), adjusted dataset loading and some parameters. ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-05 - train_batch_size: 1 - eval_batch_size: 8 - seed: 42 - gradient_accumulation_steps: 8 - total_train_batch_size: 8 - optimizer: Use OptimizerNames.ADAMW_HF with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 2 - num_epochs: 4 Approx. 30GB of GPU RAM, trained on Google colab's A100 ### Training results TrainOutput(global_step=352, training_loss=7.797419488430023, metrics={ 'train_runtime': 1653.6164, 'train_samples_per_second': 1.705, 'train_steps_per_second': 0.213, 'total_flos': 5.772661476596784e+16, 'train_loss': 7.797419488430023, 'epoch': 3.9645390070921986}) ## Usage Using a CUDA supported GPU: ```python from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration import torch from PIL import Image import requests # Model and device model_id = "lmajnaric/paligemma448_arch_finetune" device = "cuda" # Load image using path or url url = "https://cms.guggenheim-bilbao.eus/uploads/2019/05/el-edificio-guggenheim-bilbao-1.jpg" image = Image.open(requests.get(url, stream=True).raw) # image = Image.open("building.jpg") # Load model and processor with bfloat16 precision model = PaliGemmaForConditionalGeneration.from_pretrained( model_id, torch_dtype=dtype, device_map=device, ).eval() processor = AutoProcessor.from_pretrained(model_id) # Create prompt prompt = ( "Describe this building's architectural style in detail. What are its key features? " "What period and region is this style associated with? What materials are predominantly " "used in this building? Describe any notable decorative elements, patterns, or ornaments. " "Describe the overall structure, including the shape, height, and any distinctive " "architectural elements like towers, domes, or facades. If the building has a name, " "please state it in the beginning." ) # Process inputs model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device) input_len = model_inputs["input_ids"].shape[-1] # Generate text with torch.inference_mode(): generation = model.generate( **model_inputs, max_new_tokens=256, do_sample=True, # Enable sampling for more diverse outputs temperature=0.7, # Control randomness (lower = more deterministic) top_p=0.9, ) # Only decode the new tokens (not the prompt) generation = generation[0][input_len:] decoded = processor.decode(generation, skip_special_tokens=True) print(decoded) ``` or CPU: ```python from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration import torch from PIL import Image import requests # Model and device model_id = "lmajnaric/paligemma448_arch_finetune" # Load image using path or url url = "https://cms.guggenheim-bilbao.eus/uploads/2019/05/el-edificio-guggenheim-bilbao-1.jpg" image = Image.open(requests.get(url, stream=True).raw) # image = Image.open("building.jpg") # Load model and processor with bfloat16 precision model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval() processor = AutoProcessor.from_pretrained(model_id) # Create prompt prompt = ( "Describe this building's architectural style in detail. What are its key features? " "What period and region is this style associated with? What materials are predominantly " "used in this building? Describe any notable decorative elements, patterns, or ornaments. " "Describe the overall structure, including the shape, height, and any distinctive " "architectural elements like towers, domes, or facades. If the building has a name, " "please state it in the beginning." ) # Process inputs model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device) input_len = model_inputs["input_ids"].shape[-1] # Generate text with torch.inference_mode(): generation = model.generate( **model_inputs, max_new_tokens=256, do_sample=True, # Enable sampling for more diverse outputs temperature=0.7, # Control randomness (lower = more deterministic) top_p=0.9, ) # Only decode the new tokens (not the prompt) generation = generation[0][input_len:] decoded = processor.decode(generation, skip_special_tokens=True) print(decoded) ``` ### Framework versions - Transformers 4.50.0.dev0 - Pytorch 2.6.0+cu124 - Datasets 3.4.0 - Tokenizers 0.21.0