Organization Card

Robustness in Both Domains: CLIP Needs a Robust Text Encoder

Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein and Volkan Cevher

LIONS @ EPFL and Tübingen AI Center

In this repo, you will find all the models trained for our paper.

Loading CLIPModels

You can load our models as any other CLIP model, for example, loading LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2 can be done by following the "openai/clip-vit-large-patch14" example snippet:


from PIL import Image
import requests

from transformers import CLIPProcessor, CLIPModel

model_name = "LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2"
processor_name = "openai/clip-vit-large-patch14"

model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(processor_name)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

When loading other model sizes, the processor_name needs to be changed accordingly as:

Model Size	Processor Name
ViT-L-14	`"openai/clip-vit-large-patch14"`
ViT-H-14	`"laion/CLIP-ViT-H-14-laion2B-s32B-b79K"`
ViT-g-14	`"laion/CLIP-ViT-g-14-laion2B-s12B-b42K"`
ViT-bigG-14	`"laion/CLIP-ViT-bigG-14-laion2B-39B-b160k"`

Loading CLIPTextModels

If just need the text encoder, you can load it with the following snippet:

from transformers import CLIPTokenizer, CLIPTextModel

model_name = "LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2"
processor_name = "openai/clip-vit-large-patch14"

model = CLIPTextModel.from_pretrained(model_name)
tokenizer = CLIPTokenizer.from_pretrained(processor_name)

inputs = tokenizer(["a photo of a cat", "a photo of a dog"],  padding=True, return_tensors="pt")

outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooled_output # pooled (EOS token) states

Acknowledgements

Our codebase is based in the OpenCLIP codebase, we appreciate the effort of the OpenCLIP team and the release of their code and model weights.

Collections 2

models 40

datasets 0

None public yet

LEAF

AI & ML interests

Recent Activity

Robustness in Both Domains: CLIP Needs a Robust Text Encoder

Loading CLIPModels

Loading CLIPTextModels

Acknowledgements

Collections 2

LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2

LEAF-CLIP/OpenCLIP-ViT-H-rho50-k1-constrained-FARE2

LEAF-CLIP/OpenCLIP-ViT-bigG-rho50-k1-constrained

LEAF-CLIP/OpenCLIP-ViT-g-rho50-k1-constrained-FARE2

LEAF-CLIP/CLIP-ViT-L-rho50-k1-FARE2

LEAF-CLIP/CLIP-ViT-L-rho1-k1-FARE2

LEAF-CLIP/CLIP-ViT-L-rho2-k1-FARE2

LEAF-CLIP/CLIP-ViT-L-rho5-k1-FARE2

models 40

LEAF-CLIP/OpenCLIP-ViT-g-rho50-k1-constrained-FARE2

LEAF-CLIP/OpenCLIP-ViT-bigG-rho50-k1-constrained

LEAF-CLIP/OpenCLIP-ViT-H-rho50-k1-constrained-FARE2

LEAF-CLIP/CLIP-ViT-L-rho50-k1-constrained-FARE2

LEAF-CLIP/OpenCLIP-ViT-g-rho50-k1-constrained

LEAF-CLIP/OpenCLIP-ViT-g-FARE2

LEAF-CLIP/OpenCLIP-ViT-g-rho50-k1-FARE2

LEAF-CLIP/OpenCLIP-ViT-g-rho50-k1

LEAF-CLIP/OpenCLIP-ViT-g

LEAF-CLIP/OpenCLIP-ViT-bigG-rho50-k1

datasets 0

AI & ML interests

Recent Activity

Team members 4

Robustness in Both Domains: CLIP Needs a Robust Text Encoder

Loading CLIPModels

Loading CLIPTextModels

Acknowledgements

Collections 2

models 40 Sort: Recently updated

datasets 0

models 40