antoinelouis
/

splade-max-camembert-base-mmarcoFR

+---
+pipeline_tag: sentence-similarity
+language: fr
+license: mit
+datasets:
+- unicamp-dl/mmarco
+metrics:
+- recall
+tags:
+- passage-retrieval
+library_name: transformers
+base_model: almanach/camembert-base
+model-index:
+- name: spladev2-camembert-base-mmarcoFR
+  results:
+  - task:
+      type: sentence-similarity
+      name: Passage Retrieval
+    dataset:
+      type: unicamp-dl/mmarco
+      name: mMARCO-fr
+      config: french
+      split: validation
+    metrics:
+    - type: recall_at_1000
+      name: Recall@1000
+      value: 89.86
+    - type: recall_at_500
+      name: Recall@500
+      value: 85.96
+    - type: recall_at_100
+      name: Recall@100
+      value: 73.94
+    - type: recall_at_10
+      name: Recall@10
+      value: 46.33
+    - type: map_at_10
+      name: MAP@10
+      value: 24.15
+    - type: ndcg_at_10
+      name: nDCG@10
+      value: 29.58
+    - type: mrr_at_10
+      name: MRR@10
+      value: 24.68
+---
+# spladev2-camembert-base-mmarcoFR
+This is a [SPLADE-max](https://doi.org/10.48550/arXiv.2109.10086) model for **French** that can be used for semantic search. The model maps queries and passages to
+32k-dimensional sparse vectors which are used to compute relevance through cosine similarity.
+## Usage
+Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this:
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
+passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]
+tokenizer = AutoTokenizer.from_pretrained('antoinelouis/spladev2-camembert-base-mmarcoFR')
+model = AutoModel.from_pretrained('antoinelouis/spladev2-camembert-base-mmarcoFR')
+q_input = tokenizer(queries, padding=True, truncation=True, return_tensors='pt')
+p_input = tokenizer(passages, padding=True, truncation=True, return_tensors='pt')
+with torch.no_grad():
+    q_output = model(**q_input)
+    p_output = model(**p_input)
+q_activations = torch.amax(torch.log1p(input=self.relu(q_output.logits * q_input['attention_mask'].unsqueeze(-1))), dim=1)
+p_activations = torch.amax(torch.log1p(input=self.relu(p_output.logits * p_input['attention_mask'].unsqueeze(-1))), dim=1)
+q_activations = torch.nn.functional.normalize(q_activations, p=2, dim=1)
+p_activations = torch.nn.functional.normalize(p_activations, p=2, dim=1)
+similarity = q_embeddings @ p_embeddings.T
+print(similarity)
+```
+## Evaluation
+The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for a corpus of
+8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
+To see how it compares to other neural retrievers in French, check out the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.
+## Training
+#### Data
+The model is trained on the French training samples of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO that
+contains 8.8M passages and 539K training queries. We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset)
+with BM25 negatives.
+#### Implementation
+The model is initialized from the [almanach/camembert-base](https://huggingface.co/almanach/camembert-base) checkpoint and optimized via a combination of the InfoNCE
+ranking loss with a temperature of 0.05 and the FLOPS regularization loss with quadratic increase of lambda until step 33k after which it remains constant with lambda_q
+= 3e-4 and lambda_d = 1e-4. The model is fine-tuned on one 80GB NVIDIA H100 GPU for 100k steps using the AdamW optimizer with a batch size of 128, a peak learning rate
+of 2e-5 with warm up along the first 4000 steps and linear scheduling. The maximum sequence lengths for questions and passages length were fixed to 32 and 128 tokens.
+Relevance scores are computed with the cosine similarity.
+## Citation
+```bibtex
+@online{louis2024decouvrir,
+	author    = 'Antoine Louis',
+	title     = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
+	publisher = 'Hugging Face',
+	month     = 'mar',
+	year      = '2024',
+	url       = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
+}
+```