ethzanalytics/blip2-flan-t5-xl-sharded

Sharded BLIP-2 Model Card - flan-t5-xl

This is a sharded version of the blip2-flan-t5-xl which leverages Flan T5-xl for image-to-text tasks such as image captioning and visual question answering.

this model repo is sharded so it can be easily loaded on low-RAM Colab runtimes :)
Refer to the original model card for more details about the model description, intended uses, and limitations, as well as instructions for how to use the model on CPU and GPU in different precisions.

Usage

Refer to the original model card for details or see this blog post. Here is how you can use it on CPU:

Install

Requires the current main of transformers (at time of writing):

pip install accelerate git+https://github.com/huggingface/transformers.git -U -q

Use (this is for CPU, check out the original model card/blog for fp16 and int8 usage)

import requests
from PIL import Image
from transformers import BlipProcessor, Blip2ForConditionalGeneration

model_name = "ethzanalytics/blip2-flan-t5-xl-sharded"
processor = BlipProcessor.from_pretrained(model_name)
model = Blip2ForConditionalGeneration.from_pretrained(model_name)

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))