MIEB: The Benchmark That Stress-Tests Image-Text Embeddings Like Never Before

Community Article Published April 24, 2025

Key Takeaways:

  1. MIEB redefines how we evaluate image and image-text embeddings by moving beyond narrow, task-specific benchmarks to a comprehensive and unified framework.
  2. Quick and easy to integrate, run MIEB in 2 lines of code!

How good is your image embedding model—really?

It might retrieve cats like a champ or cluster satellite images beautifully. But does it understand documents? Can it handle questions grounded in vision? What about performance across 38 languages?

Turns out, no one really knew—until now.

Introducing MIEB: the Massive Image Embedding Benchmark (paper), the most ambitious attempt yet to bring clarity to the chaos of evaluating vision and multimodal models. Think of it as the MTEB moment for image embeddings, across 130 diverse tasks.

image/png

The Fragmented World of Image Embeddings

Here’s the problem MIEB solves:

Image and image-text models have been evaluated using task-specific, disconnected benchmarks—some built for clustering, others for zero-shot classification, others still for multimodal retrieval. There was no standardized way to answer the simple question:

“Which model is actually good overall?”

The lack of consistency made it impossible to compare models, track progress, or even know what “good” meant outside a narrow use case.

What MIEB Actually Measures

MIEB fills that void with a unified, multilingual, multimodal benchmark covering 8 broad categories that cover specific model capabilities, including:

  • Retrieval: Retrieval is core to search, recommendation, and multimodal assistants. MIEB expands beyond standard image-image and image-text matching to include multilingual, interleaved inputs, and retrieval with instructions — areas neglected by existing benchmarks.

  • Document Understanding (OCR + Layout): Vision models often crumble when faced with real-world documents—think receipts, forms, dense PDFs. MIEB evaluates whether models can interpret high-res, text-heavy images, bridging the gap between image recognition and text comprehension. Typical vision benchmarks barely touch this, until Vidore (Faysse et. al, 2024).

  • Visual STS (Semantic Similarity for Rendered Text): Can your model understand the meaning of text when it’s visually rendered? MIEB adapts the classic STS benchmark from NLP, turning it into a vision test by rendering sentences as images. This reveals how well models handle text understanding via visual inputs—a gap most benchmarks overlook entirely.

  • Zero-Shot Classification: In the wild, models often face new classes with no labels. MIEB tests prompt-based recognition without training—a challenge that exposes whether models can match semantics and visuals with no extra supervision.

  • Few-Shot Linear Probing: Evaluates how much structured knowledge is embedded and can be extracted with minimal data. Instead of fine-tuning on massive datasets, MIEB probes embeddings using just 16 examples per class, making it efficient and fair across models.

  • Clustering: Good embeddings should naturally group similar items—even without labels. MIEB tests the shape of the embedding space with metrics like NMI, uncovering how well models capture semantic structure.

  • Compositionality Evaluation: This assesses whether the composition of a set of visual and textual elements aligns meaningfully - capturing relationships between objects, attributes, and spatial configurations. MIEB’s compositionality evaluation tests how well embeddings encode fine-grained structure, which is often lost in coarse tasks like clustering or retrieval. It challenges models to distinguish true alignments from closely perturbed mismatches. .

  • Vision-Centric Question Answering (VCQA): We need models that can answer questions based on images, not just captions. MIEB frames VCQA as a retrieval problem, where models must select the right answer given visual context. It goes beyond simple “what is this?” questions to test counting, spatial reasoning, and real visual understanding.

All of this across 130 tasks, in 38 languages.

50 embedding models, including multimodal large language models (MLLMs) and CLIP-style models, were evaluated across MIEB tasks.

Spoiler: There’s No Supermodel (Yet)

One of MIEB’s most important findings? No single model dominates.

CLIP-style models dominate in traditional tasks like classification and retrieval. These models are trained with huge batch sizes on massive amounts of web image-text pairs.

MLLM-based models (like E5-V and Voyage) shine in document understanding, OCR-heavy tasks, and multilingual settings. These models are able to leverage their strong LLM backbone for reasoning; in their generative training stage, the LLM backbone is trained to leverage interleaved information provided by vision encoders (e.g. through training on OCR tasks etc). Such abilities are activated through lightweight contrastive learning.

But every model has blind spots—especially when it comes to reasoning, interleaved embeddings, or confounders. A promising path we see by benchmarking 50+ models on MIEB is to take the best of both worlds: merge the training paradigm of CLIP-style models, and leverage the inherent foundational reasoning abilities and interleaved input processing abilities of MLLMs.

Why MIEB Matters Now

We’re in the age of foundation models. But foundations need pressure-testing. MIEB isn’t just another benchmark—it’s a stress test for the next generation of vision models. It reveals hidden strengths, uncovers brittle weak points, and offers something the field desperately needs: clarity.

GPU Poor? We got you covered.

Covering only 51 tasks, MIEB-lite serves as a lightweight benchmark and requires as low as 18% of the GPU-hours as MIEB does, yet still preserves model ranking. For instance, running CLIP base-patch-32 on the full MIEB takes 16.6 hours on an NVIDIA H100, while running MIEB-lite only takes 4.5 hours. MIEB-lite is produced by:

  1. Discarding highly correlated tasks from pair-wise task correlations,
  2. Balancing selections from UMAP+DBSCAN task clusters, and
  3. Keeping only the lightweight tasks.

How to use MIEB in your work

MIEB is integrated into the MTEB library. If you’re already familiar with how MTEB works, then run any task and model the same way!

MIEB is also easily extensible to support custom tasks and custom model implementations.

🛠️Run MIEB in 2 lines via CLI

You can run MIEB via the MTEB CLI.

First, install mteb

pip install mteb[image]

Then, run the benchmark with a selected model:

mteb run -b ‘MIEB(Multilingual)’ -m openai/clip-vit-base-patch16

🧪 Run MIEB in Python

Similarly, running the benchmark can be done in Python in 3 main steps: 1) Select the tasks, load the model, and run the evaluation.

  1. Select the whole benchmark
import mteb

tasks = mteb.get_benchmarks("MIEB(Multilingual)")

Alternatively, select a single task:

tasks = mteb.get_tasks(tasks=["CIFAR10ZeroShot"])

Or select tasks by categories:

tasks = mteb.get_tasks(task_types=["Compositionality"])
  1. Load a Model:
model_name = "laion/CLIP-ViT-L-14-laion2B-s32B-b82K"
model = mteb.get_model(model_name=model_name)
  1. Run the Evaluation:
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model)

Extending MIEB with custom tasks and custom models

MIEB is designed to be extensible from day 1. A new instance of existing task categories and existing models can easily be added. Adding a new task category or model only requires a few more steps. For more details, see the worked example in the documentation.

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment