Massive Text Embedding Benchmark

non-profit

https://github.com/embeddings-benchmark

embeddings-benchmark

Activity Feed

AI & ML interests

Massive Text Embeddings Benchmark

Recent Activity

Muennighoff updated a dataset about 8 hours ago

mteb/arena-results

Muennighoff updated a Space 1 day ago

mteb/arena

orionweller updated a dataset 4 days ago

mteb/results

View all activity

Muennighoff

updated a dataset about 8 hours ago

mteb/arena-results

Viewer • Updated about 8 hours ago • 4.63k • 1.15k • 4

Muennighoff

updated a Space 1 day ago

113

MTEB Arena

⚔

Display text-to-text translation interface

orionweller

updated a dataset 4 days ago

mteb/results

Updated 4 days ago • 97.5k • 1

AdnanElAssadi

updated a dataset 4 days ago

mteb/sib-fleurs-multilingual-mini

Viewer • Updated 4 days ago • 11.4k • 211

imenelydiaker

authored a paper 10 days ago

LineRetriever: Planning-Aware Observation Reduction for Web Agents

Paper • 2507.00210 • Published 11 days ago • 6

tomaarsen

posted an update 11 days ago

Post

2386

‼️Sentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = 🔥 hybrid search performance! Details:

1️⃣ Sparse Encoder Models
Brand new support for sparse embedding models that generate high-dimensional embeddings (30,000+ dims) where <1% are non-zero:

- Full SPLADE, Inference-free SPLADE, and CSR architecture support
- 4 new modules, 12 new losses, 9 new evaluators
- Integration with @elastic-co , @opensearch-project , @NAVER LABS Europe, @qdrant , @IBM , etc.
- Decode interpretable embeddings to understand token importance
- Hybrid search integration to get the best of both worlds

2️⃣ Enhanced Encode Methods & Multi-Processing
- Introduce encode_query & encode_document automatically use predefined prompts
- No more manual pool management - just pass device list directly to encode()
- Much cleaner and easier to use than the old multi-process approach

3️⃣ Router Module & Advanced Training
- Router module with different processing paths for queries vs documents
- Custom learning rates for different parameter groups
- Composite loss logging - see individual loss components
- Perfect for two-tower architectures

4️⃣ Comprehensive Documentation & Training
- New Training Overview, Loss Overview, API Reference docs
- 6 new training example documentation pages
- Full integration examples with major search engines
- Extensive blogpost on training sparse models

Read the comprehensive blogpost about training sparse embedding models: https://huggingface.co/blog/train-sparse-encoder

See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/v5.0.0

What's next? We would love to hear from the community! What sparse encoder models would you like to see? And what new capabilities should Sentence Transformers handle - multimodal embeddings, late interaction models, or something else? Your feedback shapes our roadmap!

hongliu9903

authored 4 papers 11 days ago

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Paper • 2305.14342 • Published May 23, 2023

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Paper • 2402.12875 • Published Feb 20, 2024 • 13

Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models

Paper • 2210.14199 • Published Oct 25, 2022

MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings

Paper • 2506.23115 • Published 13 days ago • 36

Samoed

authored a paper 14 days ago

Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks

Paper • 2506.21182 • Published 16 days ago • 2

imenelydiaker

authored a paper 14 days ago

Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks

Paper • 2506.21182 • Published 16 days ago • 2

isaacchung

authored a paper 14 days ago

Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks

Paper • 2506.21182 • Published 16 days ago • 2

gowitheflow

authored a paper about 1 month ago

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Paper • 2506.07044 • Published Jun 8 • 108

dwzhu

authored a paper about 1 month ago

MiMo-VL Technical Report

Paper • 2506.03569 • Published Jun 4 • 74

aradhye

authored 2 papers about 1 month ago

The Third Monocular Depth Estimation Challenge

Paper • 2404.16831 • Published Apr 25, 2024

First Finish Search: Efficient Test-Time Scaling in Large Language Models

Paper • 2505.18149 • Published May 23 • 1

dwzhu

authored a paper about 2 months ago

MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

Paper • 2505.07608 • Published May 12 • 81

Muennighoff

authored a paper 2 months ago

Crosslingual Reasoning through Test-Time Scaling

Paper • 2505.05408 • Published May 8 • 8

gowitheflow

authored a paper 2 months ago

Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts

Paper • 2504.21117 • Published Apr 29 • 26

AI & ML interests

Recent Activity

Team members 37

mteb's activity

MTEB Arena