File size: 5,166 Bytes
c3deef6
 
7696112
 
 
 
c3deef6
7696112
9acfe3b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b50f73
9acfe3b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9f34c3a
 
 
 
 
da851f5
08047ad
 
 
9f34c3a
 
 
 
 
 
08047ad
 
9acfe3b
 
 
 
08047ad
8bc5035
9f34c3a
799cae9
 
 
 
 
 
 
 
 
 
 
 
9f34c3a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
license: bigscience-bloom-rail-1.0
language:
- fr
- en
pipeline_tag: sentence-similarity
---

Blommz-3b-retriever
---------------------

Introducing Bloomz-3b-retriever based on the [Bloomz-3b-sft-chat model](https://huggingface.co/cmarkea/bloomz-3b-sft-chat). This model enables the creation of an embedding representation of text and queries for a retrieval task, linking queries to documents. The model is designed to be cross-language, meaning it is language-agnostic (English/French). This model is ideal for Open Domain Question Answering (ODQA), projecting queries and text with an algebraic structure to bring them closer together.

![embedding](https://i.postimg.cc/L6KC7tvw/embedding.png)

Training
--------

It is a bi-encoder trained on a corpus of context/query pairs, with 50% in English and 50% in French. The language distribution for queries and contexts is evenly split (1/4 French-French, 1/4 French-English, 1/4 English-French, 1/4 English-English). The learning objective is to bring the embedding representation of queries and associated contexts closer using a contrastive method. The loss function is defined in [Deep Metric Learning using Triplet Network](https://arxiv.org/abs/1412.6622).

Benchmark
---------

Based on the SQuAD evaluation dataset (comprising 6000 queries distributed over 1200 contexts grouped into 35 themes), we compare the performance in terms of the average top contexter value for a query (Top-mean), the standard deviation of the average top (Top-std), and the percentage of correct queries within the top-1, top-5, and top-10. We compare the model with a TF-IDF trained on the SQuAD train sub-dataset, DistilCamemBERT, Sentence-BERT, and finally our model. We observe these performances in both monolingual and cross-language contexts (query in French and context in English).

 Model (FR/FR)                                                                                        | Top-mean | Top-std | Top-1 (%) | Top-5 (%) | Top-10 (%) |
|-----------------------------------------------------------------------------------------------------|----------|:-------:|-----------|-----------|------------|
| TF-IDF                                                                                              | 128      | 269     | 23        | 46        | 56         |
| [CamemBERT](https://huggingface.co/camembert/camembert-base)                                        | 417      | 347     | 1         | 2         | 3          |
| [Sentence-BERT](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 11       | 41      | 43        | 71        | 82         |
| Bloomz-560m-retriever                                                                               | 10       | 47      | 51        | 78        | 86         |
| Bloomz-3b-retriever                                                                                 | 9        | 37      | 50        | 79        | 87         |

Model (EN/FR)                                                                                        | Top-mean  | Top-std | Top-1 (%) | Top-5 (%) | Top-10 (%) |
|-----------------------------------------------------------------------------------------------------|----------|:-------:|-----------|-----------|------------|
| TF-IDF                                                                                              | 607      | 334     | 0         | 0         | 0          |
| [CamemBERT](https://huggingface.co/camembert/camembert-base)                                        | 432      | 345     | 0         | 1         | 1          |
| [Sentence-BERT](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 12       | 47      | 44        | 73        | 83         |
| [Bloomz-560m-retriever](https://huggingface.co/cmarkea/bloomz-560m-retriever)                       | 10       | 44      | 49        | 77        | 86         |
| [Bloomz-3b-retriever](https://huggingface.co/cmarkea/bloomz-3b-retriever)                           | 9        | 38      | 50        | 78        | 87         |


How to Use Blommz-3b-retriever
--------------------------------

The following example utilizes the API Pipeline of the Transformers library.

```python
import numpy as np
from transformers import pipeline
from scipy.spatial.distance import cdist

retriever = pipeline('feature-extraction', 'cmarkea/bloomz-3b-retriever')

# Inportant: take only last token!
infer = lambda x: [ii[0][-1] for ii in retriever(x)]

list_of_contexts = [...]
emb_contexts = np.concatenate(infer(list_of_contexts), axis=0)
list_of_queries = [...]
emb_queries = np.concatenate(infer(list_of_queries), axis=0)

# Important: take l2 distance!
dist = cdist(emb_queries, emb_contexts, 'euclidean') 
top_k = lambda x: [
    [list_of_contexts[qq] for qq in ii]
    for ii in dist.argsort(axis=-1)[:,:x]
]

# top 5 nearest contexts for each queries
top_contexts = top_k(5)
```

Citation
--------

```bibtex
@online{DeBloomzRet,
  AUTHOR = {Cyrile Delestre},
  URL = {https://huggingface.co/cmarkea/bloomz-3b-retriever},
  YEAR = {2023},
  KEYWORDS = {NLP ; Transformers ; LLM ; Bloomz},
}
```