File size: 4,172 Bytes
20244ad
 
ef23f28
 
 
 
 
 
 
 
 
1fa7f36
 
39796e1
8d82224
1331d36
 
 
 
 
 
3163b18
5c0cbe2
f748adb
1331d36
118db38
1331d36
2ef419a
1331d36
 
2ef419a
1331d36
 
2ef419a
1331d36
 
2ef419a
1331d36
 
2ef419a
20244ad
 
83a35d1
20244ad
83a35d1
bd54ed3
20244ad
 
bd54ed3
20244ad
83a35d1
4ca27c2
bd54ed3
20244ad
83a35d1
bd54ed3
20244ad
f246702
83a35d1
bd54ed3
 
20244ad
 
 
bd54ed3
 
 
 
e41e34f
bd54ed3
 
6ba0193
 
 
bd54ed3
fd2f9bd
9d388f2
bd54ed3
 
 
 
 
 
 
20244ad
fd2f9bd
 
 
 
 
 
 
 
 
 
9d388f2
fd2f9bd
bd54ed3
20244ad
118db38
ea62ef6
20244ad
cfa3d89
038f2c2
cfa3d89
 
20244ad
bd54ed3
20244ad
bd54ed3
b1ed24e
16a97f6
 
 
 
 
 
86c05b2
bd54ed3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
library_name: transformers
datasets:
- laicsiifes/flickr30k-pt-br
language:
- pt
metrics:
- bleu
- rouge
- meteor
- bertscore
base_model:
- pierreguillou/gpt2-small-portuguese
pipeline_tag: image-to-text
model-index:
- name: Swin-GPorTuguese-2
  results:
  - task:
      name: Image Captioning
      type: image-to-text
    dataset:
      name: Flickr30K
      type: laicsiifes/flickr30k-pt-br
      split: test
    metrics:
    - name: CIDEr-D
      type: cider
      value: 64.71
    - name: BLEU@4
      type: bleu
      value: 23.15
    - name: ROUGE-L
      type: rouge
      value: 39.39
    - name: METEOR
      type: meteor
      value: 44.36
    - name: BERTScore
      type: bertscore
      value: 71.70
---

# 🎉 Swin-GPorTuguese-2 for Brazilian Portuguese Image Captioning

Swin-GPorTuguese-2 model trained for image captioning on [Flickr30K Portuguese](https://huggingface.co/datasets/laicsiifes/flickr30k-pt-br) (translated version using Google Translator API)
at resolution 224x224 and max sequence length of 1024 tokens.


## 🤖 Model Description

The Swin-GPorTuguese-2 is a type of Vision Encoder Decoder which leverage the checkpoints of the [Swin Transformer](https://huggingface.co/microsoft/swin-base-patch4-window7-224)
as encoder and the checkpoints of the [GPorTuguese-2](https://huggingface.co/pierreguillou/gpt2-small-portuguese) as decoder.
The encoder checkpoints come from Swin Trasnformer version pre-trained on ImageNet-1k at resolution 224x224.

The code used for training and evaluation is available at: https://github.com/laicsiifes/ved-transformer-caption-ptbr. In this work, Swin-GPorTuguese-2
was trained together with its buddy [Swin-DistilBERTimbau](https://huggingface.co/laicsiifes/swin-distilbert-flickr30k-pt-br). 

Other models evaluated did not perform as well as Swin-DistilBERTimbau and Swin-GPorTuguese-2, namely: DeiT-BERTimbau,
DeiT-DistilBERTimbau, DeiT-GPorTuguese-2, Swin-BERTimbau, ViT-BERTimbau, ViT-DistilBERTimbau and ViT-GPorTuguese-2.

## 🧑‍💻 How to Get Started with the Model

Use the code below to get started with the model.

```python
import requests
from PIL import Image

from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel

# load a fine-tuned image captioning model and corresponding tokenizer and image processor
model = VisionEncoderDecoderModel.from_pretrained("laicsiifes/swin-gportuguese-2")
tokenizer = AutoTokenizer.from_pretrained("laicsiifes/swin-gportuguese-2")
image_processor = AutoImageProcessor.from_pretrained("laicsiifes/swin-gportuguese-2")

# preprocess an image
url = "http://images.cocodataset.org/val2014/COCO_val2014_000000458153.jpg"
image = Image.open(requests.get(url, stream=True).raw)
pixel_values = image_processor(image, return_tensors="pt").pixel_values

# generate caption
generated_ids = model.generate(pixel_values)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
```

```python
import matplotlib.pyplot as plt

# plot image with caption
plt.imshow(image)
plt.axis("off")
plt.title(generated_text)
plt.show()
```

![image/png](https://cdn-uploads.huggingface.co/production/uploads/637a149c0dbdecf0b5bd6490/ih9NZRoAWfPXx2vXDgeSV.png)

## 📈 Results

The evaluation metrics CIDEr-D, BLEU@4, ROUGE-L, METEOR and BERTScore
(using [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased)) are abbreviated as C, B@4, RL, M and BS, respectively.

|Model|Dataset|Eval. Split|C|B@4|RL|M|BS|
|:---:|:------:|:--------:|:-----:|:----:|:-----:|:----:|:-------:|
|Swin-DistilBERTimbau|Flickr30K Portuguese|test|66.73|24.65|39.98|44.71|72.30|
|Swin-GPorTuguese-2|Flickr30K Portuguese|test|64.71|23.15|39.39|44.36|71.70|

## 📋 BibTeX entry and citation info

```bibtex
@inproceedings{bromonschenkel2024comparative,
  title={A Comparative Evaluation of Transformer-Based Vision Encoder-Decoder Models for Brazilian Portuguese Image Captioning},
  author={Bromonschenkel, Gabriel and Oliveira, Hil{\'a}rio and Paix{\~a}o, Thiago M},
  booktitle={2024 37th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)},
  pages={1--6},
  year={2024},
  organization={IEEE}
}
```