Spaces:
Sleeping
A newer version of the Gradio SDK is available:
5.20.0
Fine-tuning BEiT-3 on Image Captioning
COCO Captioning Setup
- Setup environment.
- Download 2014 train images, 2014 val images and karpathy split, then organize the dataset as following structure:
/path/to/your_data/
train2014/
COCO_train2014_000000000009.jpg
...
val2014/
COCO_val2014_000000000042.jpg
...
dataset_coco.json
We then generate the index json files using the following command. beit3.spm is the sentencepiece model used for tokenizing texts.
from datasets import CaptioningDataset
from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
CaptioningDataset.make_coco_captioning_dataset_index(
data_path="/path/to/your_data",
tokenizer=tokenizer,
)
NoCaps Setup
- Setup environment.
- Download NoCaps val set, NoCaps test set and download imags using the urls in val and test json files, then organize the dataset as following structure:
/path/to/your_data/
val/
09c863d76bcf6b00.jpg
...
test/
19dc6913830a0a21.jpg
...
nocaps_val_4500_captions.json
nocaps_test_image_info.json
We then generate the index json files using the following command. beit3.spm is the sentencepiece model used for tokenizing texts.
from datasets import CaptioningDataset
from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer("/your_beit3_model_path/beit3.spm")
CaptioningDataset.make_nocaps_captioning_dataset_index(
data_path="/path/to/your_data",
)
We use COCO captioning training set as the training data of NoCaps.
Example: Fine-tuning BEiT-3 on Captioning
The BEiT-3 base model can be fine-tuned on captioning tasks using 8 V100-32GB:
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_base_patch16_480 \
--input_size 480 \
--task coco_captioning \
--batch_size 32 \
--layer_decay 1.0 \
--lr 4e-5 \
--randaug \
--epochs 10 \
--warmup_epochs 1 \
--drop_path 0.1 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_base_patch16_224.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_model \
--log_dir /path/to/save/your_model/log \
--weight_decay 0.05 \
--seed 42 \
--save_ckpt_freq 5 \
--num_max_bpe_tokens 32 \
--captioning_mask_prob 0.7 \
--drop_worst_after 12000 \
--dist_eval \
--checkpoint_activations \
--enable_deepspeed
--batch_size
: batch size per GPU. Effective batch size =number of GPUs
*--batch_size
*--update_freq
. So in the above example, the effective batch size is8*32 = 256
.--finetune
: weight path of your pretrained models; please download the pretrained model weights in README.md.--task
: coco_captioning for COCO captioning and nocaps for NoCaps dataset.lr
: 4e-5 for COCO captioning and 1e-5 for NoCaps.--enable_deepspeed
: optional. If you use apex, please enable deepspeed.--checkpoint_activations
: using gradient checkpointing for saving GPU memory.
The BEiT-3 large model can be fine-tuned on captioning tasks using 8 V100-32GB:
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_large_patch16_480 \
--input_size 480 \
--task coco_captioning \
--batch_size 32 \
--layer_decay 1.0 \
--lr 8e-6 \
--randaug \
--epochs 10 \
--warmup_epochs 1 \
--drop_path 0.1 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_large_patch16_224.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_model \
--log_dir /path/to/save/your_model/log \
--weight_decay 0.05 \
--seed 42 \
--save_ckpt_freq 5 \
--num_max_bpe_tokens 32 \
--captioning_mask_prob 0.7 \
--drop_worst_after 12000 \
--dist_eval \
--checkpoint_activations \
--enable_deepspeed
--batch_size
: batch size per GPU. Effective batch size =number of GPUs
*--batch_size
*--update_freq
. So in the above example, the effective batch size is8*32 = 256
.--finetune
: weight path of your pretrained models; please download the pretrained model weights in README.md.--task
: coco_captioning for COCO captioning and nocaps for NoCaps dataset.lr
: 8e-6 for COCO captioning and NoCaps.--enable_deepspeed
: optional. If you use apex, please enable deepspeed.--checkpoint_activations
: using gradient checkpointing for saving GPU memory.
Example: Evaluate BEiT-3 Fine-tuned model on Captioning
- Get the prediction file of the fine-tuned BEiT3-base model on captioning with 8 V100-32GB:
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_base_patch16_480 \
--input_size 480 \
--task coco_captioning \
--batch_size 16 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_base_patch16_480_coco_captioning.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_prediction \
--eval \
--dist_eval
--task
: coco_captioning for COCO captioning and nocaps for NoCaps dataset.--finetune
: beit3_base_patch16_480_coco_captioning.pth for COCO captioning and beit3_base_patch16_480_nocaps.pth for NoCaps dataset.Get the prediction file of the fine-tuned BEiT3-large model on captioning with 8 V100-32GB:
python -m torch.distributed.launch --nproc_per_node=8 run_beit3_finetuning.py \
--model beit3_large_patch16_480 \
--input_size 480 \
--task coco_captioning \
--batch_size 16 \
--sentencepiece_model /your_beit3_model_path/beit3.spm \
--finetune /your_beit3_model_path/beit3_large_patch16_480_coco_captioning.pth \
--data_path /path/to/your_data \
--output_dir /path/to/save/your_prediction \
--eval \
--dist_eval
--task
: coco_captioning for COCO captioning and nocaps for NoCaps dataset.--finetune
: beit3_large_patch16_480_coco_captioning.pth for COCO captioning and beit3_large_patch16_480_nocaps.pth for NoCaps dataset.
Please then submit the prediction file in the output_dir
to the evaluation server to obtain the NoCaps val and test results.