Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,117 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: sentence-similarity
|
3 |
+
language: fr
|
4 |
+
license: mit
|
5 |
+
datasets:
|
6 |
+
- unicamp-dl/mmarco
|
7 |
+
metrics:
|
8 |
+
- recall
|
9 |
+
tags:
|
10 |
+
- passage-retrieval
|
11 |
+
library_name: transformers
|
12 |
+
base_model: almanach/camembert-base
|
13 |
+
model-index:
|
14 |
+
- name: spladev2-camembert-base-mmarcoFR
|
15 |
+
results:
|
16 |
+
- task:
|
17 |
+
type: sentence-similarity
|
18 |
+
name: Passage Retrieval
|
19 |
+
dataset:
|
20 |
+
type: unicamp-dl/mmarco
|
21 |
+
name: mMARCO-fr
|
22 |
+
config: french
|
23 |
+
split: validation
|
24 |
+
metrics:
|
25 |
+
- type: recall_at_1000
|
26 |
+
name: Recall@1000
|
27 |
+
value: 89.86
|
28 |
+
- type: recall_at_500
|
29 |
+
name: Recall@500
|
30 |
+
value: 85.96
|
31 |
+
- type: recall_at_100
|
32 |
+
name: Recall@100
|
33 |
+
value: 73.94
|
34 |
+
- type: recall_at_10
|
35 |
+
name: Recall@10
|
36 |
+
value: 46.33
|
37 |
+
- type: map_at_10
|
38 |
+
name: MAP@10
|
39 |
+
value: 24.15
|
40 |
+
- type: ndcg_at_10
|
41 |
+
name: nDCG@10
|
42 |
+
value: 29.58
|
43 |
+
- type: mrr_at_10
|
44 |
+
name: MRR@10
|
45 |
+
value: 24.68
|
46 |
+
---
|
47 |
+
|
48 |
+
# spladev2-camembert-base-mmarcoFR
|
49 |
+
|
50 |
+
This is a [SPLADE-max](https://doi.org/10.48550/arXiv.2109.10086) model for **French** that can be used for semantic search. The model maps queries and passages to
|
51 |
+
32k-dimensional sparse vectors which are used to compute relevance through cosine similarity.
|
52 |
+
|
53 |
+
## Usage
|
54 |
+
|
55 |
+
Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this:
|
56 |
+
|
57 |
+
```python
|
58 |
+
import torch
|
59 |
+
from transformers import AutoTokenizer, AutoModel
|
60 |
+
|
61 |
+
queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
|
62 |
+
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]
|
63 |
+
|
64 |
+
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/spladev2-camembert-base-mmarcoFR')
|
65 |
+
model = AutoModel.from_pretrained('antoinelouis/spladev2-camembert-base-mmarcoFR')
|
66 |
+
|
67 |
+
q_input = tokenizer(queries, padding=True, truncation=True, return_tensors='pt')
|
68 |
+
p_input = tokenizer(passages, padding=True, truncation=True, return_tensors='pt')
|
69 |
+
|
70 |
+
with torch.no_grad():
|
71 |
+
q_output = model(**q_input)
|
72 |
+
p_output = model(**p_input)
|
73 |
+
|
74 |
+
q_activations = torch.amax(torch.log1p(input=self.relu(q_output.logits * q_input['attention_mask'].unsqueeze(-1))), dim=1)
|
75 |
+
p_activations = torch.amax(torch.log1p(input=self.relu(p_output.logits * p_input['attention_mask'].unsqueeze(-1))), dim=1)
|
76 |
+
|
77 |
+
q_activations = torch.nn.functional.normalize(q_activations, p=2, dim=1)
|
78 |
+
p_activations = torch.nn.functional.normalize(p_activations, p=2, dim=1)
|
79 |
+
|
80 |
+
similarity = q_embeddings @ p_embeddings.T
|
81 |
+
print(similarity)
|
82 |
+
```
|
83 |
+
|
84 |
+
## Evaluation
|
85 |
+
|
86 |
+
The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for a corpus of
|
87 |
+
8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
|
88 |
+
To see how it compares to other neural retrievers in French, check out the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.
|
89 |
+
|
90 |
+
## Training
|
91 |
+
|
92 |
+
#### Data
|
93 |
+
|
94 |
+
The model is trained on the French training samples of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO that
|
95 |
+
contains 8.8M passages and 539K training queries. We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset)
|
96 |
+
with BM25 negatives.
|
97 |
+
|
98 |
+
#### Implementation
|
99 |
+
|
100 |
+
The model is initialized from the [almanach/camembert-base](https://huggingface.co/almanach/camembert-base) checkpoint and optimized via a combination of the InfoNCE
|
101 |
+
ranking loss with a temperature of 0.05 and the FLOPS regularization loss with quadratic increase of lambda until step 33k after which it remains constant with lambda_q
|
102 |
+
= 3e-4 and lambda_d = 1e-4. The model is fine-tuned on one 80GB NVIDIA H100 GPU for 100k steps using the AdamW optimizer with a batch size of 128, a peak learning rate
|
103 |
+
of 2e-5 with warm up along the first 4000 steps and linear scheduling. The maximum sequence lengths for questions and passages length were fixed to 32 and 128 tokens.
|
104 |
+
Relevance scores are computed with the cosine similarity.
|
105 |
+
|
106 |
+
## Citation
|
107 |
+
|
108 |
+
```bibtex
|
109 |
+
@online{louis2024decouvrir,
|
110 |
+
author = 'Antoine Louis',
|
111 |
+
title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
|
112 |
+
publisher = 'Hugging Face',
|
113 |
+
month = 'mar',
|
114 |
+
year = '2024',
|
115 |
+
url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
|
116 |
+
}
|
117 |
+
```
|