Mitigating False Negatives in Multiple Negatives Ranking Loss for Retriever Training
Multiple Negatives Ranking Loss and In-Batch Negatives
When training sentence-transformers (bi-encoder models for text embeddings) with contrastive learning, a common objective is the Multiple Negatives Ranking (MNR) loss. In this setup, we have an anchor sentence and a positive sentence that form a true pair (e.g. a query and a relevant passage, or two paraphrases). The model is trained such that the anchor’s embedding is pulled close to its positive, while being pushed far from other sentences’ embeddings treated as negatives. Importantly, MNR loss makes heavy use of in-batch negatives: given a batch of (anchor, positive) pairs (or anchor-positive-negative triplets), any other non-matching sentence in the same batch is considered a negative example. In other words, for a given anchor, all the positives from the other pairs in the batch act as negatives (assuming they are unrelated). This in-batch negative sampling greatly boosts efficiency since we get many negatives for free without explicit mining.
Why in-batch negatives? The larger the batch, the more negative pairs we can derive, and typically the better the model learns. Intuitively, each additional example in the batch provides another “distractor” that the anchor should not match, forcing the model to sharpen its embedding distinctions. MNR loss essentially maximizes the similarity of the true (anchor, positive) pair while minimizing similarity for all anchor with other (anchor’, positive’) pairs in the batch. It can be seen as a form of InfoNCE contrastive loss across the batch.
However, this approach assumes that any two non-matching sentences are truly negative. This naive assumption can fail – sometimes an “anchor” and some other “positive” in the batch might actually be semantically similar (e.g. two different questions about the same topic, or paraphrased sentences). In such cases, treating them as negatives is wrong; they are false negatives because they should not be pushed apart by the loss. Unfortunately, MNR as originally formulated will still treat them as negatives, which could hurt training. We will discuss how to address this issue.
Hard Negatives vs. In-Batch Negatives
Not all negative examples are equal. In practice we distinguish between hard negatives and the default in-batch negatives(sometimes called random negatives if not specifically mined):
- Hard negatives: These are negative samples selected because they are particularly challenging – for example, a non-relevant passage that is lexically or semantically similar to the query, or a sentence that differs from the anchor by only a small nuance. Hard negatives are often obtained via mining: using an existing retrieval model or cross-encoder to find top-ranked, but incorrect, results for a given query. Because they resemble the positives, hard negatives can teach the model to make fine-grained distinctions. Many state-of-the-art embedding model pipelines include a mining step to collect such difficult negatives (e.g. using a bi-encoder or cross-encoder teacher on a large corpus). The downside is that if the mining method is too “good,” some retrieved “negatives” might actually be true positives that are simply unlabeled – i.e. false negatives. For example, In the NV-Retriever paper, the authors cite prior work (Qu et al., 2020) showing that among the passages most similar to MS MARCO queries—commonly used as hard negatives—around 70% were actually relevant and should have been labeled as positives. This highlights a risk: using extremely similar (top-ranked) items as negatives can introduce a lot of noise if your labels are incomplete.
- In-batch negatives: As described, these are the other examples in the training batch which, by construction, are assumed to be unrelated to the anchor. They are “free” negatives that require no mining step. The challenge with in-batch negatives is that most will be easy negatives (truly unrelated sentences), especially if your batch is composed of random pairings. Easy negatives don’t teach as much as hard ones. The benefit, though, is volume: a batch of size B provides B–1 negatives per anchor. If B is large, the sheer variety of negatives can include some moderately hard ones by chance (and at least forces the model to discriminate many different sentences). If B is small, the negatives are few and you might miss out on learning from difficult contrasts. In sum, large batches make in-batch negatives far more effective, as noted in multiple works. In fact, the recent E5 embedding model paper explicitly showed that scaling the batch size from 1K to 32K yielded consistent gains on multiple evaluation sets. They also noted that if large batches are infeasible, one can compensate by adding some hard negatives to smaller batches to achieve similar benefits. Similarly, the BGE M3-Embedding model from BAAI emphasized optimizing the batching strategy to enable very large batches and high throughput, which improved the discriminativeness of the learned embeddings.
In summary, hard negatives provide targeted difficult comparisons but require careful mining (and risk false negatives if your mining is too aggressive), whereas in-batch negatives are easy to obtain and scalable, but require large batch sizes for maximum effect. Many modern training setups actually combine both: e.g. use a couple of mined hard negatives for each anchor, and still use the rest of the batch as additional negatives. This way, even a moderate batch size can include some strong negatives. If hard negatives are provided, you may not need quite as large a batch to reach good performance – indeed, using hard negatives can let you get away with smaller batches while still training an effective model. But if you rely only on random in-batch negatives, increasing batch size is crucial for strong performance.
The False Negative Problem
While increasing batch size tends to improve contrastive learning, it exacerbates the earlier mentioned issue: false negatives. In very large batches drawn from a diverse dataset, it becomes increasingly likely that some other example in the batch is actually semantically related to a given anchor (even though it’s not the designated positive). For example, if your batch has 1024 sentence pairs, that’s 1023 potential “negatives” for each anchor – the odds that at least one of those is actually a true match (perhaps a duplicate question or a paraphrase) go up as the batch grows. Pushing those accidentally-related pairs apart is harmful to the model. Datasets such as MS MARCO or Natural Questions have many missing labels – there are often multiple relevant passages for a query but only one marked as the positive. All other relevant passages are, by default, treated as negatives in training, which injects noise into the contrastive objective. The authors of NV-Retriever highlight this, noting that standard hard-negative mining can introduce false negatives for exactly this reason: many top-ranked “negatives” turned out to be relevant upon closer inspection.
In-batch negatives have the same issue: by always treating any non-matching pair in the batch as a negative, we risk penalizing the model for assigning high similarity to what might actually be a related pair. we need a mechanism to detect and avoid false negatives during training, especially as batches (or mined negative pools) grow large.
Removing False Negatives in Hard Negative Mining and In-Batch Negatives
When addressing false negatives in embedding model training, it is important to distinguish between two stages where they can arise—during hard negative mining (in the data preparation phase) and within in-batch negatives (during model training). The key difference is that false negatives in hard negative mining are typically filtered before training during the data sampling process, while false negatives among in-batch negatives are filtered dynamically during training using a guide model. This distinction highlights different mitigation strategies depending on when and where false negatives appear in the training pipeline.
There are two primary approaches for addressing the issue of false negatives during the training of embedding models. The first approach involves performing positive-aware hard-negative mining for anchor-positive pairs by utilizing margin-based filtering criteria, as extensively explored in the NV-Retriever paper. The second approach focuses specifically on removing false negatives from in-batch negatives by employing a guide model.
The first method for mitigating false negatives is to apply positive-aware mining during hard negative mining by introducing a margin-based threshold in the negative selection process. This idea was recently explored in the NV-Retriever paper, which proposed using the positive’s relevance score as an anchor to decide which negatives to filter out. Essentially, if a negative’s score is close to the positive’s score, remove it from training. The NV-Retriever work explored both: they tried an absolute score threshold (e.g. drop any negative with a relevance score above 0.7) and a positive-relative threshold (drop negatives that score at least, say, 95% of the positive’s score). They found that the positive-aware relative threshold performed best, since it adapts to each query’s specific positive score. A negative should be removed if it is almost on par with the positive (within a few percent), which is a strong indicator of a false negative. This “margin” method gives a finer control: rather than ignoring all moderately high-scoring negatives, you can tune how close is “too close.” For instance, a 5% margin (retain negatives only if they have <95% of the positive’s similarity) turned out to be an optimal setting in NV-Retriever’s ablation study.
Here's an example of how to use the mine_hard_negatives
function from the sentence-transformers
library.
from sentence_transformers.util import mine_hard_negatives
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
# Load a Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")
# Load a dataset to mine hard negatives from
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
dataset = mine_hard_negatives(
dataset=dataset,
model=model,
range_min=10,
range_max=50,
max_score=0.8,
relative_margin=0.05,
num_negatives=5,
sampling_strategy="random",
batch_size=128,
use_faiss=True,
)
The second method, which focuses specifically on removing false negatives from in-batch negatives by employing a guide (teacher) model, has been implemented in the Sentence Transformers library as GISTEmbedLoss (Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning). GISTEmbedLoss extends the standard MultipleNegativesRankingLoss by incorporating a guide model to assist in selecting in-batch negatives. During training, for each anchor and each potential negative in the batch, the guide model computes a similarity score. If the guide determines that a negative is more similar to the anchor than its corresponding positive, that pair is masked out and excluded from the loss calculation. By filtering out these likely false negatives, GISTEmbedLoss provides a cleaner and more reliable training signal, resulting in improved model stability and embedding robustness. And as one might expect, higher quality training data leads to a better model. In fact, using GISTEmbedLoss in place of the standard loss has been shown to improve performance, as demonstrated both in the original paper and in our own experiments.
Below is an example of how to train an embedding model using the CachedGISTEmbedLoss
from the sentence-transformers
library, which supports margin-based filtering to exclude false negatives during training.
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
from datasets import Dataset
model = SentenceTransformer("microsoft/mpnet-base")
guide = SentenceTransformer("all-MiniLM-L6-v2")
train_dataset = Dataset.from_dict({
"anchor": ["It's nice weather outside today.", "He drove to work."],
"positive": ["It's so sunny.", "He took the car to the office."],
})
loss = losses.GISTEmbedLoss(
model,
guide,
mini_batch_size=64,
margin_strategy="absolute", # or "relative" (e.g., margin=0.05 for max. 95% of positive similarity)
margin=0.1
)
trainer = SentenceTransformerTrainer(
model=model,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
Both approaches—removing false negatives in hard negative mining through margin-based filtering, and in in-batch negatives using a guide model—offer complementary strategies to effectively reduce false negatives, resulting in cleaner training signals and better embedding model performance.
CachedMultipleNegativesRankingLoss: Enabling Large Batches
Before diving deeper into the guided loss and margins, it’s worth mentioning the innovation that made very large batches feasible in practice: CachedMultipleNegativesRankingLoss. This is a variant of MNR loss introduced in Sentence Transformers that uses a two-step computation (embedding caching and then loss computation) so that one can effectively use extremely large virtual batch sizes without running out of GPU memory. The idea (inspired by techniques like gradient checkpointing/caching and cross-device negative pooling) is to first compute embeddings for a large set of examples (aggregating multiple mini-batches), and then compute the contrastive loss over that larger set by reusing the stored embeddings rather than keeping the entire batch in memory at once. This allows, for example, simulating a batch of thousands of examples even if your GPU could only fit, say, 128 at a time. CachedMultipleNegativesRankingLoss enabled researchers to scale batch size much further (e.g. to thousands) and thereby improve model performance by leveraging more in-batch negatives. In fact, the original paper on Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup cited by the Sentence Transformers docs showed that such negative caches can dramatically boost retrieval performance without additional memory cost.
Building upon this, CachedGISTEmbedLoss combines the memory efficiency of Cached MNR with the false negative filtering capability of GISTEmbedLoss. It brings together the best of both worlds: the ability to simulate massive batch sizes and the robustness of using a guide model to identify and filter out misleading negatives. The caching mechanism ensures high-throughput training with stable memory usage, while the guided component actively removes false negatives that could otherwise destabilize training. The result is a more scalable, stable, and effective contrastive learning objective. CachedGISTEmbedLoss enables practitioners to harness the full power of large batch training—rich in negatives—without compromising on signal quality.
Here's an example of using CachedGISTEmbedLoss
, a loss function provided by the sentence-transformers
library that enables large batch training by caching embeddings and computing the loss separately.
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
from datasets import Dataset
model = SentenceTransformer("microsoft/mpnet-base")
guide = SentenceTransformer("all-MiniLM-L6-v2")
train_dataset = Dataset.from_dict({
"anchor": ["It's nice weather outside today.", "He drove to work."],
"positive": ["It's so sunny.", "He took the car to the office."],
})
loss = losses.CachedGISTEmbedLoss(
model,
guide,
mini_batch_size=64,
margin_strategy="absolute", # or "relative" (e.g., margin=0.05 for max. 95% of positive similarity)
margin=0.1
)
trainer = SentenceTransformerTrainer(
model=model,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
Margin-Enhanced GISTEmbedLoss (Absolute vs. Relative Margins)
CachedGISTEmbedLoss adopts the positive-aware strategy proposed in the NV-Retriever paper by incorporating a configurable margin strategy for filtering negatives, giving users fine-grained control over how aggressively to filter out potential false negatives. You can choose between two modes:
- Absolute margin: A fixed value margin m is used. The guide model computes the similarity score for the anchor-positive pair (call it S_pos) and for each anchor-negative pair (S_neg). Any negative that has
S_neg ≥ S_pos – m
is discarded. Essentially, if a negative’s score is within m of the positive’s score, we consider it too close to be a true negative. For example, if m = 0.1 and the positive similarity is 0.8, then any negative scoring ≥ 0.7 with the anchor would be filtered out. - Relative margin: A percentage-based criterion. Instead of an absolute difference, we use a fraction of the positive’s score. A negative is filtered if
S_neg ≥ S_pos * (1 - r)
for some ratio r. For instance, with r = 0.05 (i.e. 5%), if the positive score is 0.8, we drop negatives scoring ≥ 0.8 * 0.95 = 0.76. This scales with the difficulty of the positive: if S_pos is lower, the threshold for negatives is also lower in absolute terms. The relative strategy means “don’t allow negatives that come within X% of the positive’s score.”
These margin-based filters help catch borderline cases. A small margin (absolute or relative) means only negatives almost as good as the positive are removed; a larger margin would filter out even moderately similar negatives. The margin can thus be tuned based on how strict or lenient you want to be in labeling something a “false negative.” In practice, the relative margin has been found to work extremely well, as it adapts to each example’s context. The NV-Retriever experiments correspond to this: their best-performing method, called TopK-PercPos, effectively used a relative margin of 5% (they kept negatives that had <95% of the positive’s relevance score). This dynamic threshold outperformed a static cutoff in their ablation study. CachedGISTEmbedLoss allows you to easily try both strategies (margin_strategy="absolute"
or "relative"
) and set the margin
value appropriate for your task.
By adjusting the margin, one can control the precision-recall trade-off in negative filtering. A zero margin essentially replicates the original GISTEmbedLoss behavior (only drop negatives that are at least as similar as the positive). A slight margin (like 5-10%) will catch those cases where a negative is almost as similar, even if not exceeding the positive’s score. Increasing the margin further would make the filtering more aggressive (potentially filtering some legitimate negatives that happen to have moderately high similarity).
When to Use CachedGISTEmbedLoss
CachedGISTEmbedLoss shines when you have abundant data and some potential noise/overlap in it, and you want to squeeze out the maximum embedding performance:
- If you can afford longer training time and want the absolute best model, using a huge batch (with caching) often yields improvements that smaller batches can't match. For example, training on millions of pairs for semantic search, you might use an effective batch size of 16k or more to maximize recall.
- If your dataset is noisy or has many similar pairs, the guide model will protect against learning from false negatives. This often translates to better generalization.
- On the flip side, if your training data is very clean (no chance of false negatives) and you’re limited in time, you might opt for CachedMultipleNegativesRankingLoss (without a guide) for simplicity, or even standard MNRL if batch sizes aren’t an issue. GIST’s benefits come at the cost of extra computations (running a guide model) and slightly slower training.
Experimental Results
Contrastive losses that rely on in-batch negatives can suffer from false negatives—examples that are actually semantically related but treated as negatives. To address this, we introduce a flexible, margin-based filtering mechanism in our custom CachedGISTEmbedLoss
. By discarding any negative whose similarity to the anchor comes too close to (or above) the positive’s similarity—either by a fixed amount (absolute margin) or by a fraction (relative margin)—we ensure the model focuses on genuinely hard negatives.
Below we summarize two sets of experiments:
- English Data Experiment: Trained the mpnet-base model on the AllNLI(sentence-transformers/all-nli) dataset.
- Korean Data Experiment: Fine-tuned intfloat/multilingual-e5-small on Korean query-passage pairs dataset.
Experiment 1: AllNLI, mpnet-base
We fine-tuned Microsoft’s mpnet-base model on the AllNLI(sentence-transformers/all-nli) dataset from Hugging Face, using the first 100 000 anchor–positive–negative examples for training and the standard dev/test splits for evaluation. As a guide model, we leveraged the all-mpnet-base-v2 to compute guidance similarities in our CachedGISTEmbedLoss
.
For each training run, we set:
Loss margin strategy
Absolute (e.g.
margin=0.1
): drop negatives withsim(neg,anchor) ≥ sim(pos,anchor) – 0.1
Relative (e.g.
margin=0.05
): drop negatives withsim(neg,anchor) ≥ sim(pos,anchor) × 0.95
Hyperparameters
- Setting 1: batch size 512, lr=2 × 10⁻⁵, 1 epoch, warmup 0.1, FP16
- Setting 2: batch size 2048, lr=4 × 10⁻⁵, 1 epoch, warmup 0.1, FP16


- With bs=512, an absolute margin of 0.2 yielded the highest dev-accuracy (~0.889), outperforming standard MNR by ~0.017 points. no margin, or standard MNR resulted in relatively lower performance.
- With bs=2048, a relative margin of 0.15 achieved the best early convergence and top accuracy (~0.857), ~0.014 point above MNR. no margin, or standard MNR resulted in relatively lower performance.
Experiment 2: Korean Query–Passage Retrieval
We then applied the CachedGISTEmbedLoss to a Korean retrieval task, using a dataset of Korean query–passage pairs for training. We fine-tuned the intfloat/multilingual-e5-small model and used the same model as the guide to compute similarity scores during training. Evaluation was conducted on the MTEB-ko-retrieval dataset(Git-Hub.
- Hyperparameters
- Setting: batch size 20,000, lr=2.5 × 10⁻⁴, 2 epochs, warmup 0.05, FP16
Baselines
- Base Model (no training): NDCG@10 = 0.671
- MNR (MultipleNegativesRankingLoss): 0.626 (–0.045 vs base)
- GISTEmbedLoss (no margin): 0.678 (+0.007 vs base)
Relative Margins (GISTEmbedLoss)
- Margin 0.85 → 0.683 (+0.012 vs base)
- Margin 0.90 → 0.684 (+0.013 vs base)
- Margin 0.95 → 0.682 (+0.011 vs base)
Absolute Margins (GISTEmbedLoss)
- Margin 0.85 → 0.678 (+0.007 vs base)
- Margin 0.90 → 0.686 (+0.015 vs base)
- Margin 0.95 → 0.686 (+0.015 vs base)

The best settings (absolute 0.90/0.95) delivered up to +0.015 NDCG@10 over the MNR baseline—nearly doubling the no-margin gain. These results confirm that margin-based filtering, inspired by NV-Retriever’s false-negative removal, is highly effective.
Experimental Conclusion
Across both English and Korean retrieval tasks, GISTEmbedLoss consistently outperforms MNR, demonstrating strong robustness across languages and domains. By tuning just a single scalar parameter—the margin—practitioners can achieve significant improvements in contrastive training performance.
Interestingly, in the Korean experiment, training with MNR Loss alone resulted in performance lower than the base model, suggesting that failing to remove false negatives from in-batch negatives can significantly degrade retrieval quality. This underscores the importance of effective false negative mitigation in contrastive learning pipelines.
Notably, the application of margins further amplifies the advantage of GISTEmbedLoss, with the optimal margin value varying by language and guide model. In the English experiment using the mpnet-base model, an absolute margin of 0.2 yielded the highest performance. In contrast, for the Korean query–passage experiment, the best results were obtained with margins between 0.05 and 0.1.
This difference in optimal margin settings aligns with the temperature values of the guide models used during training. Temperature, which inversely relates to similarity sharpness, affects how widely or narrowly similarity scores are distributed. A lower temperature leads to more concentrated similarity scores, which favors smaller margins, while a higher temperature yields flatter distributions that benefit from larger margins. The English guide model, all-mpnet-base-v2, was trained with a temperature of 0.05, supporting the use of a larger margin. Meanwhile, among the multilingual guide models tested for Korean—intfloat/multilingual-e5-small, BAAI/BGE-m3, and snowflake-arctic-embed-l-v2.0—the snowflake-arctic-embed-l-v2.0 model achieved the best results. This arctic-embed model was trained with a temperature of 0.02, and its resulting concentrated similarity distribution favored smaller margin settings during filtering.
A more detailed discussion of this temperature-margin interaction can be found in the related GitHub issue.
Conclusion
In contrastive learning for retriever models, eliminating false negatives is crucial not only for improving performance but also for ensuring that the learned embeddings reflect true semantic relationships. False negatives—samples that are semantically related to the anchor but treated as negatives—can mislead the model into separating concepts that should be close, ultimately harming embedding quality and generalization.
This article explored two complementary strategies for mitigating false negatives:
- Filtering false negatives during hard negative mining using margin-based thresholds, as proposed in the NV-Retriever paper. By adopting a positive-aware filtering mechanism, the model discards negatives that are too close in similarity to the positive, thereby preventing misleading supervision. Both absolute and relative margin strategies offer fine-grained control over this filtering process.
- Removing false negatives among in-batch negatives using a guide model, as implemented in GISTEmbedLoss. The guide model estimates similarity scores and excludes any in-batch negatives that appear more similar to the anchor than the designated positive. This enhances the reliability of the loss signal and prevents penalizing semantically correct alignments.
Based on the experimental results, a model trained with GISTEmbedLoss (using an appropriate guide and margin) typically outperforms one trained with MultipleNegativesRankingLoss under otherwise identical conditions. This is because it learns from a richer set of negatives via large batches while avoiding misleading supervision from false negatives through the use of guided margins. The result is not only improved benchmark performance but also greater training stability. These findings highlight the importance of actively identifying and filtering out false negatives—particularly in retrieval scenarios with incomplete labeling—as a means to enhance contrastive learning. By providing cleaner supervision in a scalable manner, CachedGISTEmbedLoss enables the development of more semantically accurate embedding models, making it a valuable option for training high-performing sentence transformers across various domains.
📎 References
Sentence Transformers Training Overview https://sbert.net/docs/sentence_transformer/training_overview.html
Sentence Transformers Losses – MultipleNegativesRankingLoss, GISTEmbedLoss explanations and examples https://www.sbert.net/docs/package_reference/losses.html
GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning https://arxiv.org/abs/2402.16829
Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models https://arxiv.org/abs/2405.05374
NV-Retriever: Improving Text Embedding Models with Effective Hard-Negative Mining https://arxiv.org/abs/2407.15831
Text Embeddings by Weakly-Supervised Contrastive Pretraining (E5) https://arxiv.org/abs/2212.03533
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation https://arxiv.org/abs/2402.03216
GradCache: Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup https://arxiv.org/pdf/2101.06983
MTEB-ko-retrieval https://github.com/nlpai-lab/KURE
Git Pull Request: Adding Margin to CachedGISTEmbedLoss https://github.com/UKPLab/sentence-transformers/pull/3299
Experiment Result: Korean Query–Passage Retrieval Model https://huggingface.co/dragonkue/multilingual-e5-small-ko