Spaces-explorers

Activity Feed Request to join this org

AI & ML interests

Contributors who are invited to beta-test our next big feature! Contact us if you want to join this team :-)

Recent Activity

sadrasabouri authored a paper about 1 month ago

naab: A ready-to-use plug-and-play corpus for Farsi

muhammadravi251001 authored a paper about 1 month ago

Language Surgery in Multilingual Large Language Models

muhammadravi251001 authored a paper 5 months ago

Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

View all activity

machineteacher

authored a paper 3 months ago

Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models

Paper • 2504.20157 • Published Apr 28 • 38

mohsenfayyaz

authored a paper 5 months ago

Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence

Paper • 2503.05037 • Published Mar 6 • 4

fuwei

authored a paper 5 months ago

TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation

Paper • 2502.07870 • Published Feb 11 • 45

jxm

posted an update 6 months ago

Post

744

New state-of-the-art BERT-size retrieval model: *cde-small-v2* 🥳🍾

Hi everyone! We at Cornell are releasing a new retrieval model this week. It uses the contextual embeddings framework, is based on ModernBERT backbone, and gets state-of-the-art results on the MTEB benchmark for its model size (140M parameters). cde-small-v2 gets an average score of 65.6 across the 56 datasets and sees improvements from our previous model in *every* task domain (retrieval, classification, etc.).

We made a lot of changes to make this model work. First of all, ModernBERT has a better tokenizer, which probably helped this work out-of-the-box. We also followed the principles from the CDE paper and used harder clusters and better hard-negative filtering, which showed a small performance improvement. And we made a few small changes that have been shown to work on the larger models: we disabled weight decay, masked out the prefix tokens during pooling, and added a residual connection from the first-stage to the second-stage for better gradient flow.

We're still looking for a computer sponsor to help us scale CDE to larger models. Since it's now state-of-the-art at the 100M parameter scale, it seems to be a reasonable bet that we could train a state-of-the-art large model if we had the GPUs. If you're interested in helping with this, please reach out!

Here's a link to the model: jxm/cde-small-v2
And here's a link to the paper: Contextual Document Embeddings (2410.02525)