Model Summary
MEXMA-SigLIP2 is a model that combines the MEXMA multilingual text encoder and an image encoder from the SigLIP2 model. This allows us to get a high-performance CLIP model for 80 languages. MEXMA-SigLIP2 sets new state-of-the-art on the Crossmodal-3600 dataset with 62.54% R@1 for image retrieval and 59.99% R@1 for text retrieval.
How to use
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import requests
import torch
model = AutoModel.from_pretrained("visheratin/mexma-siglip2", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip2")
processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip2")
img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
img = processor(images=img, return_tensors="pt")["pixel_values"]
img = img.to(torch.bfloat16).to("cuda")
with torch.inference_mode():
text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda")
image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
probs = image_logits.softmax(dim=-1)
print(probs)
Acknowledgements
I thank ML Collective for providing compute resources to train the model.
- Downloads last month
- 29
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.
Space using visheratin/mexma-siglip2 1
Evaluation results
- Image retrieval R@1 on Crossmodal-3600self-reported62.54%
- Text retrieval R@1 on Crossmodal-3600self-reported59.99%