Bloomz-3b Reranking

This reranking model is built from cmarkea/bloomz-3b-dpo-chat model and aims to measure the semantic correspondence between a question (query) and a context. With its normalized scoring, it helps to filter the query/context matchings outputted by a retriever in an ODQA (Open-Domain Question Answering) context. Moreover, it allows to reorder the results using a more efficient modeling approach than the retriever one. However, this modeling type is not conducive to direct database searching due to its high computational cost.

Developed to be language-agnostic, this model supports both French and English. Consequently, it can effectively score in a cross-language context without being influenced by its behavior in a monolingual context (English or French).

Dataset

The training dataset is composed of the mMARCO dataset, consisting of query/positive/hard negative triplets. Additionally, we have included SQuAD data from the "train" split, forming query/positive/hard negative triplets. In order to generate hard negative data for SQuAD, we considered contexts from the same theme as the query but from a different set of queries. Hence, the negative observations belong to the same themes as the queries but presumably do not contain the answer to the question.

Finally, the triplets are flattened to obtain pairs of query/context sentences with a label 1 if query/positive and a label 0 if query/negative. In each element of the pair (query and context), the language, French or English, is randomly and uniformly chosen.

Evaluation

To assess the performance of the reranker, we will make use of the "validation" split of the SQuAD dataset. We will select the first question from each paragraph, along with the paragraph constituting the context that should be ranked Top-1 for an Oracle modeling. What's intriguing is that the number of themes is limited, and each context from a corresponding theme that does not match the query is considered as a hard negative (other contexts outside the theme are simple negatives). Thus, we can construct the following table, with each theme showing the number of contexts and associated query:

Theme name	Context number	Theme name	Context number
Normans	39	Civil_disobedience	26
Computational_complexity_theory	48	Construction	22
Southern_California	39	Private_school	26
Sky_(United_Kingdom)	22	Harvard_University	30
Victoria_(Australia)	25	Jacksonville,_Florida	21
Huguenot	44	Economic_inequality	44
Steam_engine	46	University_of_Chicago	37
Oxygen	43	Yuan_dynasty	47
1973_oil_crisis	24	Immune_system	49
European_Union_law	40	Intergovernmental_Panel_on_Climate_Change	24
Amazon_rainforest	21	Prime_number	31
Ctenophora	31	Rhine	44
Fresno,_California	28	Scottish_Parliament	39
Packet_switching	23	Islamism	39
Black_Death	23	Imperialism	39
Geology	25	Warsaw	49
Pharmacy	26	French_and_Indian_War	46
Force	44

The evaluation corpus consists of 1204 pairs of query/context to be ranked.

Firstly, the evaluation scores were computed in cases where both the query and the context are in the same language (French/French).

Model (French/French)	Top-mean	Top-std	Top-1 (%)	Top-10 (%)	Top-100 (%)	MRR (x100)	mean score Top	std score Top
BM25	14.47	92.19	69.77	92.03	98.09	77.74	NA	NA
CamemBERT	5.72	36.88	69.35	95.51	98.92	79.51	0.83	0.37
DistilCamemBERT	5.54	25.90	66.11	92.77	99.17	76.00	0.80	0.39
mMiniLMv2-L12	4.43	30.27	71.51	95.68	99.42	80.17	0.78	0.38
RoBERTa (multilingual)	15.13	60.39	57.23	83.87	96.18	66.21	0.53	0.11
cmarkea/bloomz-560m-reranking	1.49	2.58	83.55	99.17	100	89.98	0.93	0.15
cmarkea/bloomz-3b-reranking	1.22	1.06	89.37	99.75	100	93.79	0.94	0.10

Then, we evaluated the model in a cross-language context, with queries in French and contexts in English.

Model (French/English)	Top-mean	Top-std	Top-1 (%)	Top-10 (%)	Top-100 (%)	MRR (x100)	mean score Top	std score Top
BM25	288.04	371.46	21.93	41.93	55.15	28.41	NA	NA
CamemBERT	12.20	61.39	59.55	89.71	97.42	70.38	0.65	0.47
DistilCamemBERT	40.97	104.78	25.66	64.78	88.62	38.83	0.53	0.49
mMiniLMv2-L12	6.91	32.16	59.88	89.95	99.09	70.39	0.61	0.46
RoBERTa (multilingual)	79.32	153.62	27.91	49.50	78.16	35.41	0.40	0.12
cmarkea/bloomz-560m-reranking	1.51	1.92	81.89	99.09	100	88.64	0.92	0.15
cmarkea/bloomz-3b-reranking	1.22	0.98	89.20	99.84	100	93.63	0.94	0.10

As observed, the cross-language context does not significantly impact the behavior of our models. If the model were used in a context of reranking and filtering the Top-K results from a search, a threshold of 0.8 could be applied to filter the contexts outputted by the retriever, thereby reducing noise issues present in the contexts for RAG-type applications.

How to Use Bloomz-3b-reranking

The following example is based on the API Pipeline of the Transformers library.

from transformers import pipeline

reranker = pipeline(
    task='text-classification',
    model='cmarkea/bloomz-3b-reranking',
    top_k=None
)

query: str
contexts: List[str]

similarities = reranker(
    [
        dict(
            text=context, # the model was trained with context in `text`
            text_pair=query # and query in `text_pair` argument.
        )
        for context in contexts
    ]
)

score_label_1 = [
    next(item['score'] for item in entry if item['label'] == 'LABEL_1') 
    for entry in similarities
]
contexts_reranked = sorted(
    zip(score_label_1, contexts),
    key=lambda x: x[0],
    reverse=True
)

score, contexts_cleaned = zip(
    *filter(
        lambda x: x[0] >= 0.8,
        contexts_reranked
    )
)

Citation

@online{DeBloomzReranking,
  AUTHOR = {Cyrile Delestre},
  ORGANIZATION = {Cr{\'e}dit Mutuel Ark{\'e}a},
  URL = {https://huggingface.co/cmarkea/bloomz-3b-reranking},
  YEAR = {2024},
  KEYWORDS = {NLP ; Transformers ; LLM ; Bloomz},
}