File size: 4,091 Bytes
5a71bfd
 
 
 
 
 
 
 
 
 
 
c359c0d
 
 
 
5a71bfd
 
 
 
c359c0d
 
 
 
 
 
 
 
5a71bfd
 
 
adcbbe8
5a71bfd
 
 
c359c0d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5a71bfd
c359c0d
5a71bfd
c359c0d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef0baf0
 
 
 
c359c0d
5a71bfd
 
 
c359c0d
 
5a71bfd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6bec5e7
5a71bfd
 
 
 
 
 
c359c0d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
tags:
- generated_from_trainer
datasets:
- coco
metrics:
- rouge
- bleu
model-index:
- name: vit-swin-base-224-gpt2-image-captioning
  results: []
license: mit
language:
- en
pipeline_tag: image-to-text
---

# vit-swin-base-224-gpt2-image-captioning

This model is a fine-tuned [VisionEncoderDecoder](https://huggingface.co/docs/transformers/model_doc/vision-encoder-decoder) model on 60% of the [COCO2014](https://huggingface.co/datasets/HuggingFaceM4/COCO) dataset.
It achieves the following results on the testing set:
- Loss: 0.7989
- Rouge1: 53.1153
- Rouge2: 24.2307
- Rougel: 51.5002
- Rougelsum: 51.4983
- Bleu: 17.7765

## Model description

The model was initialized on [microsoft/swin-base-patch4-window7-224-in22k](https://huggingface.co/microsoft/swin-base-patch4-window7-224-in22k) as the vision encoder, the [gpt2](https://huggingface.co/gpt2) as the decoder.

## Intended uses & limitations

You can use this model for image captioning only.

## How to use

You can either use the simple pipeline API:

```python
from transformers import pipeline

image_captioner = pipeline("image-to-text", model="Abdou/vit-swin-base-224-gpt2-image-captioning")
# infer the caption
caption = image_captioner("http://images.cocodataset.org/test-stuff2017/000000000019.jpg")[0]['generated_text']
print(f"caption: {caption}")

```

Or initialize everything for more flexibility:

```python
from transformers import VisionEncoderDecoderModel, GPT2TokenizerFast, ViTImageProcessor
import torch

# a function to perform inference
def get_caption(model, image_processor, tokenizer, image_path):
    image = load_image(image_path)
    # preprocess the image
    img = image_processor(image, return_tensors="pt").to(device)
    # generate the caption (using greedy decoding by default)
    output = model.generate(**img)
    # decode the output
    caption = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
    return caption

device = "cuda" if torch.cuda.is_available() else "cpu"
# load the fine-tuned image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning").to(device)
tokenizer = GPT2TokenizerFast.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")
image_processor = ViTImageProcessor.from_pretrained("Abdou/vit-swin-base-224-gpt2-image-captioning")

# target image
url = "http://images.cocodataset.org/test-stuff2017/000000000019.jpg"
# get the caption
caption = get_caption(model, image_processor, tokenizer, url)
print(f"caption: {caption}")

```
Output:
```
Two cows laying in a field with a sky background.
```

## Training procedure

You can check [this guide](https://www.thepythoncode.com/article/image-captioning-with-pytorch-and-transformers-in-python) to learn how this model was fine-tuned.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 64
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 2

### Training results

| Training Loss | Epoch | Step  | Validation Loss | Rouge1  | Rouge2  | Rougel  | Rougelsum | Bleu    | Gen Len |
|:-------------:|:-----:|:-----:|:---------------:|:-------:|:-------:|:-------:|:---------:|:-------:|:-------:|
| 1.0018        | 0.38  | 2000  | 0.8860          | 38.6537 | 13.8145 | 35.3932 | 35.3935   | 8.2448  | 11.2946 |
| 0.8827        | 0.75  | 4000  | 0.8395          | 40.0458 | 14.8829 | 36.5321 | 36.5366   | 9.1169  | 11.2946 |
| 0.8378        | 1.13  | 6000  | 0.8140          | 41.2736 | 15.9576 | 37.5504 | 37.5512   | 9.871   | 11.2946 |
| 0.7913        | 1.51  | 8000  | 0.8012          | 41.6642 | 16.1987 | 37.8786 | 37.8891   | 10.0786 | 11.2946 |
| 0.7794        | 1.89  | 10000 | 0.7933          | 41.9119 | 16.3738 | 38.1062 | 38.1292   | 10.288  | 11.2946 |

Total training time: ~5 hours on NVIDIA A100 GPU.

### Framework versions

- Transformers 4.26.0
- Pytorch 1.13.1+cu116
- Datasets 2.9.0
- Tokenizers 0.13.2