File size: 7,198 Bytes
6fc683c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
(The following contents are from [the ViLT repo](https://github.com/dandelin/ViLT/blob/master/DATA.md).)

# Dataset Preparation
We utilize seven datsets: Google Conceptual Captions (GCC), Stony Brook University Captions (SBU), Visual Genome (VG), COCO Captions (COCO), Flickr 30K Captions (F30K), Visual Question Answering v2 (VQAv2), and Natural Language for Visual Reasoning 2 (NLVR2).

We do not distribute datasets because of the license issue.
Please download the datasets by yourself.
We use `pyarrow` to serialize the datasets, conversion scripts are located in `vilt/utils/write_*.py`.
Please organize the datasets as follows and run `make_arrow` functions to convert the dataset to pyarrow binary file.

## GCC
https://ai.google.com/research/ConceptualCaptions/download

GCC provides tuples of image url and caption, note that a quite portion of the urls are unaccessible now.
Write your own download script and organize the dataset as following structure.

    root
    β”œβ”€β”€ images_train            
    β”‚   β”œβ”€β”€ 0000                # First four letters of image name
    β”‚   β”‚   β”œβ”€β”€ 0000000         # Image Binary
    β”‚   β”‚   β”œβ”€β”€ 0000001      
    β”‚   β”‚   └── ...
    β”‚   β”œβ”€β”€ 0001              
    β”‚   β”‚   β”œβ”€β”€ 0001000      
    β”‚   β”‚   β”œβ”€β”€ 0001001      
    β”‚   β”‚   └── ...          
    β”‚   └── ...          
    β”œβ”€β”€ images_val          
    β”‚   β”œβ”€β”€ 0000              
    β”‚   β”‚   └── ...
    β”‚   └── ...          
    β”œβ”€β”€ train_annot.json        # List of (image_file_path, caption) tuple
    └── val_annot.json          # List of (image_file_path, caption) tuple

```python
from vlmo.utils.write_conceptual_caption import make_arrow
make_arrow(root, arrows_root)
```

## SBU
http://www.cs.virginia.edu/~vicente/sbucaptions/

Similar to GCC, SBU also provides tuples of image url and caption, and also a quite portion of the urls are unaccessible now.
Write your own download script and organize the dataset as following structure.

    root
    β”œβ”€β”€ images_train            
    β”‚   β”œβ”€β”€ 0000                # First four letters of image name
    β”‚   β”‚   β”œβ”€β”€ 0000000         # Image Binary
    β”‚   β”‚   β”œβ”€β”€ 0000001      
    β”‚   β”‚   └── ...
    β”‚   β”œβ”€β”€ 0001              
    β”‚   β”‚   β”œβ”€β”€ 0001000      
    β”‚   β”‚   β”œβ”€β”€ 0001001      
    β”‚   β”‚   └── ...          
    β”‚   └── ...          
    └── annot.json              # List of (image_file_path, caption) tuple

```python
from vlmo.utils.write_sbu import make_arrow
make_arrow(root, arrows_root)
```

## VG
http://visualgenome.org/api/v0/api_home.html

Download [image part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [image part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip) and [region descriptions](http://visualgenome.org/static/data/dataset/region_descriptions.json.zip)

    root
    β”œβ”€β”€ images            
    β”‚   β”œβ”€β”€ VG_100K                  
    β”‚   β”‚   β”œβ”€β”€ 10.jpg        
    β”‚   β”‚   β”œβ”€β”€ 107899.jpg      
    β”‚   β”‚   └── ...
    β”‚   β”œβ”€β”€ VG_100K_2              
    β”‚   β”‚   β”œβ”€β”€ 1.jpg      
    β”‚   β”‚   β”œβ”€β”€ 100.jpg      
    β”‚   β”‚   └── ...          
    β”‚   └── ...          
    └── annotations         
        └── region_descriptions.json

```python
from vlmo.utils.write_vg import make_arrow
make_arrow(root, arrows_root)
```

## COCO
https://cocodataset.org/#download

Download [2014 train images](http://images.cocodataset.org/zips/train2014.zip), [2014 val images](http://images.cocodataset.org/zips/val2014.zip) and [karpathy split](https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip)

    root
    β”œβ”€β”€ train2014            
    β”‚   β”œβ”€β”€ COCO_train2014_000000000009.jpg                
    |   └── ...
    β”œβ”€β”€ val2014              
    |   β”œβ”€β”€ COCO_val2014_000000000042.jpg
    |   └── ...          
    └── karpathy
        └── dataset_coco.json

```python
from vlmo.utils.write_coco_karpathy import make_arrow
make_arrow(root, arrows_root)
```

## F30K
http://bryanplummer.com/Flickr30kEntities/

Sign [flickr images request form](https://forms.illinois.edu/sec/229675) and download [karpathy split](https://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip)

    root
    β”œβ”€β”€ flickr30k-images            
    β”‚   β”œβ”€β”€ 1000092795.jpg
    |   └── ...
    └── karpathy
        └── dataset_flickr30k.json

```python
from vlmo.utils.write_f30k_karpathy import make_arrow
make_arrow(root, arrows_root)
```

## VQAv2
https://visualqa.org/download.html

Download COCO [2014 train images](http://images.cocodataset.org/zips/train2014.zip), [2014 val images](http://images.cocodataset.org/zips/val2014.zip), [2015 test images](http://images.cocodataset.org/zips/test2015.zip), annotations ([train](https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Train_mscoco.zip), [val](https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip)), and questions ([train](https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Train_mscoco.zip), [val](https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Val_mscoco.zip), [test](https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Test_mscoco.zip))

    root
    β”œβ”€β”€ train2014            
    β”‚   β”œβ”€β”€ COCO_train2014_000000000009.jpg                
    |   └── ...
    β”œβ”€β”€ val2014              
    |   β”œβ”€β”€ COCO_val2014_000000000042.jpg
    |   └── ...  
    β”œβ”€β”€ test2015              
    |   β”œβ”€β”€ COCO_test2015_000000000001.jpg
    |   └── ...         
    β”œβ”€β”€ v2_OpenEnded_mscoco_train2014_questions.json
    β”œβ”€β”€ v2_OpenEnded_mscoco_val2014_questions.json
    β”œβ”€β”€ v2_OpenEnded_mscoco_test2015_questions.json
    β”œβ”€β”€ v2_OpenEnded_mscoco_test-dev2015_questions.json
    β”œβ”€β”€ v2_mscoco_train2014_annotations.json
    └── v2_mscoco_val2014_annotations.json

```python
from vlmo.utils.write_vqa import make_arrow
make_arrow(root, arrows_root)
```

## NLVR2
Clone the [repository](https://github.com/lil-lab/nlvr) and sign the [request form](https://goo.gl/forms/yS29stWnFWzrDBFH3) to download the images.

    root
    β”œβ”€β”€ images/train           
    β”‚   β”œβ”€β”€ 0                  
    β”‚   β”‚   β”œβ”€β”€ train-10108-0-img0.png   
    β”‚   β”‚   └── ...
    β”‚   β”œβ”€β”€ 1                  
    β”‚   β”‚   β”œβ”€β”€ train-10056-0-img0.png       
    β”‚   β”‚   └── ...
    β”‚   └── ...
    β”œβ”€β”€ dev       
    β”‚   β”œβ”€β”€ dev-0-0-img0.png
    |   └── ...
    β”œβ”€β”€ test1     
    β”‚   β”œβ”€β”€ test1-0-0-img0.png
    |   └── ...
    β”œβ”€β”€ nlvr
    β”œβ”€β”€ nlvr2
    └── README.md

```python
from vlmo.utils.write_nlvr2 import make_arrow
make_arrow(root, arrows_root)
```

## WikiBK (Text only data)
```python
from vlmo.utils.write_wikibk import make_arrow
make_arrow(root, arrows_root)
```