vit-swin-base-224-gpt2-image-captioning
This model is a fine-tuned VisionEncoderDecoder model on 60% of the COCO2014 dataset. It achieves the following results on the testing set:
- Loss: 0.7989
- Rouge1: 53.1153
- Rouge2: 24.2307
- Rougel: 51.5002
- Rougelsum: 51.4983
- Bleu: 17.7765
Model description
The model was initialized on microsoft/swin-base-patch4-window7-224-in22k as the vision encoder, the gpt2 as the decoder.
Intended uses & limitations
You can use this model for image captioning only.
How to use
You can either use the simple pipeline API:
from transformers import pipeline
image_captioner = pipeline("image-to-text", model="Abdou/vit-swin-base-224-gpt2-image-captioning")
# infer the caption
caption = image_captioner("http://images.cocodataset.org/test-stuff2017/000000000019.jpg")[0]['generated_text']
print(f"caption: {caption}")
Or initialize everything for more flexibility:
from transformers import VisionEncoderDecoderModel, GPT2TokenizerFast, ViTImageProcessor
import torch
import os
import urllib.parse as parse
from PIL import Image
import requests
# a function to determine whether a string is a URL or not
def is_url(string):
try:
result = parse.urlparse(string)
return all([result.scheme, result.netloc, result.path])
except:
return False
# a function to load an image
def load_image(image_path):
if is_url(image_path):
return Image.open(requests.get(image_path, stream=True).raw)
elif os.path.exists(image_path):
return Image.open(image_path)
# a function to perform inference
def get_caption(model, image_processor, tokenizer, image_path):
image = load_image(image_path)
# preprocess the image
img = image_processor(image, return_tensors="pt").to(device)
# generate the caption (using greedy decoding by default)
output = model.generate(**img)
# decode the output
caption = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
return caption
device = "cuda" if torch.cuda.is_available() else "cpu"
# load the fine-tuned image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning").to(device)
tokenizer = GPT2TokenizerFast.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")
image_processor = ViTImageProcessor.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")
# target image
url = "http://images.cocodataset.org/test-stuff2017/000000000019.jpg"
# get the caption
caption = get_caption(model, image_processor, tokenizer, url)
print(f"caption: {caption}")
Output:
Two cows laying in a field with a sky background.
Training procedure
You can check this guide to learn how this model was fine-tuned.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 64
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 2
Training results
Training Loss | Epoch | Step | Validation Loss | Rouge1 | Rouge2 | Rougel | Rougelsum | Bleu | Gen Len |
---|---|---|---|---|---|---|---|---|---|
1.0018 | 0.38 | 2000 | 0.8860 | 38.6537 | 13.8145 | 35.3932 | 35.3935 | 8.2448 | 11.2946 |
0.8827 | 0.75 | 4000 | 0.8395 | 40.0458 | 14.8829 | 36.5321 | 36.5366 | 9.1169 | 11.2946 |
0.8378 | 1.13 | 6000 | 0.8140 | 41.2736 | 15.9576 | 37.5504 | 37.5512 | 9.871 | 11.2946 |
0.7913 | 1.51 | 8000 | 0.8012 | 41.6642 | 16.1987 | 37.8786 | 37.8891 | 10.0786 | 11.2946 |
0.7794 | 1.89 | 10000 | 0.7933 | 41.9119 | 16.3738 | 38.1062 | 38.1292 | 10.288 | 11.2946 |
Total training time: ~5 hours on NVIDIA A100 GPU.
Framework versions
- Transformers 4.26.0
- Pytorch 1.13.1+cu116
- Datasets 2.9.0
- Tokenizers 0.13.2
- Downloads last month
- 187
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.