Papers
arxiv:2505.16967

Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

Published on May 22
ยท Submitted by nthakur on May 23
Authors:
,
,
,

Abstract

Using cascading LLM prompts to identify and relabel false negatives in datasets improves retrieval and reranking models' performance.

AI-generated summary

Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35times and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.

Community

Relabeling datasets for Information Retrieval improves NDCG@10 of both embedding models & cross-encoder rerankers. This was already the prevalent belief, but now it's been confirmed. Great job @nthakur , @crystina-z , @MrLight & @lintool

See the organization with datasets & models here: https://huggingface.co/rlhn

  • Tom Aarsen
Paper submitter

Did you know that fine-tuning retrievers & re-rankers on large but unclean training datasets can harm their performance? ๐Ÿ˜ก

In our new preprint, we reexamine the quality of popular IR training data by pruning datasets and identifying and relabeling ๐Ÿ๐š๐ฅ๐ฌ๐ž-๐ง๐ž๐ ๐š๐ญ๐ข๐ฏ๐ž๐ฌ!

Preprint: https://arxiv.org/abs/2505.16967

๐ŸŒŸ๐๐ซ๐ž๐ฅ๐ข๐ฆ๐ข๐ง๐š๐ซ๐ฒ
We fine-tune E5 (base) on 16 retrieval datasets from BGE collection (1.6M training pairs) and conduct a leave-one-out analysis: leaving one dataset out and fine-tuning on the rest. Removing ELI5 alone surprisingly can improve nDCG@10 on 7/14 BEIR datasets! ๐Ÿคฏ

๐Ÿš€ ๐ƒ๐š๐ญ๐š๐ฌ๐ž๐ญ ๐๐ซ๐ฎ๐ง๐ข๐ง๐ 
1๏ธโƒฃ We effectively prune 8/15 training datasets, leaving 7 datasets, reducing the training pairs by 2.35x (1.6M -> 680K pairs).
2๏ธโƒฃ E5 (base) fine-tuned on 7 datasets outperforms the model on all 15 datasets, by 1.0 nDCG@10 on BEIR.
3๏ธโƒฃ This shows that some datasets are harmful to model performance.

๐Ÿ“Š ๐…๐š๐ฅ๐ฌ๐ž ๐๐ž๐ ๐š๐ญ๐ข๐ฏ๐ž๐ฌ
In pruned training datasets, we observe a common issue of "false negatives": where hard negatives are incorrectly classified as irrelevant! We propose a LLM judge cascading framework (๐‘๐‹๐‡๐) to identify and relabel these false negatives in training datasets.

We carefully measure three operations with identified false negatives in training pairs:
1๏ธโƒฃ Remove: Discard the training pair completely with a false negative.
2๏ธโƒฃ HN Remove: Discard only the false negatives from the list of hard negatives
3๏ธโƒฃ ๐‘๐‹๐‡๐: Relabel the false negatives as positives, while keeping the remaining list of hard negatives.

๐Ÿ“Š ๐„๐ฑ๐ฉ๐ž๐ซ๐ข๐ฆ๐ž๐ง๐ญ๐š๐ฅ ๐‘๐ž๐ฌ๐ฎ๐ฅ๐ญ๐ฌ
๐‘๐‹๐‡๐ gains the best improvement in retrievers and rerankers in contrast to other approaches. ๐‘๐‹๐‡๐ starts to show consistent gains even if we label a small subset of training pairs, especially the OOD nDCG@10 on BEIR (Avg. 7) and AIR-Bench (Avg. 5), both improve steadily with more and more clean data.

We also qualitatively analyzed the different categories of identified false negatives, e.g., the query can be ambiguous, which can lead to many hard negatives actually relevant to it.

Paper: https://arxiv.org/abs/2505.16967
Code: https://github.com/castorini/rlhn
Data: https://huggingface.co/rlhn

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.16967 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.16967 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.16967 in a Space README.md to link it from this page.

Collections including this paper 1