TLDR: We trained a Flamingo with Llama2-Chat7B as LLM on CC3M in less than 5 hours using just 4 A100s.

The model showed promising zero-shot captioning skills. High-quality captioning data really helps fast alignment.

You could test it via following code. Be sure to visit Otter to get necessary Flamingo/Otter models.

from flamingo.modeling_flamingo import FlamingoForConditionalGeneration
flamingo_model = FlamingoForConditionalGeneration.from_pretrained("luodian/Flamingo-Llama2-Chat7B-CC3M", device_map=auto)
prompt = "<image>an image of"
simple_prompt = "<image>"
Downloads last month
15
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.