
LEAF
community
AI & ML interests
None defined yet.
Recent Activity
View all activity
Organization Card
Robustness in Both Domains: CLIP Needs a Robust Text Encoder
Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein and Volkan Cevher
LIONS @ EPFL and Tübingen AI Center
In this repo, you will find all the models trained for our paper.
Loading CLIPModels
You can load our models as any other CLIP model, for example, loading LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2
can be done by following the "openai/clip-vit-large-patch14" example snippet:
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model_name = "LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2"
processor_name = "openai/clip-vit-large-patch14"
model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(processor_name)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
When loading other model sizes, the processor_name
needs to be changed accordingly as:
Model Size | Processor Name |
---|---|
ViT-L-14 | "openai/clip-vit-large-patch14" |
ViT-H-14 | "laion/CLIP-ViT-H-14-laion2B-s32B-b79K" |
ViT-g-14 | "laion/CLIP-ViT-g-14-laion2B-s12B-b42K" |
ViT-bigG-14 | "laion/CLIP-ViT-bigG-14-laion2B-39B-b160k" |
Loading CLIPTextModels
If just need the text encoder, you can load it with the following snippet:
from transformers import CLIPTokenizer, CLIPTextModel
model_name = "LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2"
processor_name = "openai/clip-vit-large-patch14"
model = CLIPTextModel.from_pretrained(model_name)
tokenizer = CLIPTokenizer.from_pretrained(processor_name)
inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooled_output # pooled (EOS token) states
Acknowledgements
Our codebase is based in the OpenCLIP codebase, we appreciate the effort of the OpenCLIP team and the release of their code and model weights.
Collections
2
models
40

LEAF-CLIP/OpenCLIP-ViT-g-rho50-k1-constrained-FARE2
Updated
•
28

LEAF-CLIP/OpenCLIP-ViT-bigG-rho50-k1-constrained
Updated
•
36

LEAF-CLIP/OpenCLIP-ViT-H-rho50-k1-constrained-FARE2
Updated
•
53

LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2
Updated
•
246

LEAF-CLIP/OpenCLIP-ViT-g-rho50-k1-constrained
Updated
•
4

LEAF-CLIP/OpenCLIP-ViT-g-FARE2
Updated
•
18

LEAF-CLIP/OpenCLIP-ViT-g-rho50-k1-FARE2
Updated
•
2

LEAF-CLIP/OpenCLIP-ViT-g-rho50-k1
Updated
•
2

LEAF-CLIP/OpenCLIP-ViT-g
Updated
•
52

LEAF-CLIP/OpenCLIP-ViT-bigG-rho50-k1
Updated
•
5
datasets
0
None public yet