Spaces:

LEAF-CLIP
/

README

Running

App Files Files Community

README / README.md

megaelius

Update README.md

0453e48 verified 4 months ago

preview code

raw

history blame contribute delete

2.94 kB

metadata

title: README
emoji: 📚
colorFrom: pink
colorTo: indigo
sdk: static
pinned: false

Robustness in Both Domains: CLIP Needs a Robust Text Encoder

Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein and Volkan Cevher

LIONS @ EPFL and Tübingen AI Center

In this repo, you will find all the models trained for our paper.

Loading CLIPModels

You can load our models as any other CLIP model, for example, loading LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2 can be done by following the "openai/clip-vit-large-patch14" example snippet:


from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model_name = "LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2"
processor_name = "openai/clip-vit-large-patch14"

model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(processor_name)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

When loading other model sizes, the processor_name needs to be changed accordingly as:

Model Size	Processor Name
ViT-L-14	`"openai/clip-vit-large-patch14"`
ViT-H-14	`"laion/CLIP-ViT-H-14-laion2B-s32B-b79K"`
ViT-g-14	`"laion/CLIP-ViT-g-14-laion2B-s12B-b42K"`
ViT-bigG-14	`"laion/CLIP-ViT-bigG-14-laion2B-39B-b160k"`

Loading CLIPTextModels

If just need the text encoder, you can load it with the following snippet:

from transformers import CLIPTokenizer, CLIPTextModel

model_name = "LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2"
processor_name = "openai/clip-vit-large-patch14"

model = CLIPTextModel.from_pretrained(model_name)
tokenizer = CLIPTokenizer.from_pretrained(processor_name)

inputs = tokenizer(["a photo of a cat", "a photo of a dog"],  padding=True, return_tensors="pt")

outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooled_output # pooled (EOS token) states

Acknowledgements

Our codebase is based in the OpenCLIP codebase, we appreciate the effort of the OpenCLIP team and the release of their code and model weights.