File size: 5,088 Bytes
b2420a4 31f1527 b2420a4 31f1527 b2420a4 cb02614 b2420a4 cb02614 b2420a4 78efea0 b2420a4 78efea0 b2420a4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
---
pipeline_tag: text-ranking
tags:
- transformers
- information-retrieval
language: pl
license: apache-2.0
base_model:
- sdadas/polish-reranker-base-ranknet
library_name: sentence-transformers
---
<h1 align="center">polish-reranker-base-ranknet</h1>
### This repository extends the original repository by providing an ONNX version for optimized inference.
This is a Polish text ranking model trained with [RankNet loss](https://icml.cc/Conferences/2015/wp-content/uploads/2015/06/icml_ranking.pdf) on a large dataset of text pairs consisting of 1.4 million queries and 10 million documents.
The training data included the following parts: 1) The Polish MS MARCO training split (800k queries); 2) The ELI5 dataset translated to Polish (over 500k queries); 3) A collection of Polish medical questions and answers (approximately 100k queries).
As a teacher model, we employed [unicamp-dl/mt5-13b-mmarco-100k](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k), a large multilingual reranker based on the MT5-XXL architecture. As a student model, we choose [Polish RoBERTa](https://huggingface.co/sdadas/polish-roberta-base-v2).
Unlike more commonly used pointwise losses, which regard each query-document pair independently, the RankNet method computes loss based on queries and pairs of documents. More specifically, the loss is computed based on the relative order of documents sorted by their relevance to the query.
To train the reranker, we used the teacher model to assess the relevance of the documents extracted in the retrieval stage for each query. We then sorted these documents by the relevance score, obtaining a dataset consisting of queries and ordered lists of 20 documents per query.
## Usage (Sentence-Transformers)
You can use the model like this with [sentence-transformers](https://www.SBERT.net):
```python
from sentence_transformers import CrossEncoder
import torch.nn
query = "Jak dożyć 100 lat?"
answers = [
"Trzeba zdrowo się odżywiać i uprawiać sport.",
"Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
"Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model = CrossEncoder(
"axotion/polish-reranker-base-ranknet",
default_activation_function=torch.nn.Identity(),
max_length=512,
device="cuda" if torch.cuda.is_available() else "cpu"
)
pairs = [[query, answer] for answer in answers]
results = model.predict(pairs)
print(results.tolist())
```
## Usage (Huggingface Transformers)
The model can also be used with Huggingface Transformers in the following way:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np
query = "Jak dożyć 100 lat?"
answers = [
"Trzeba zdrowo się odżywiać i uprawiać sport.",
"Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
"Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model_name = "axotion/polish-reranker-base-ranknet"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
texts = [f"{query}</s></s>{answer}" for answer in answers]
tokens = tokenizer(texts, padding="longest", max_length=512, truncation=True, return_tensors="pt")
output = model(**tokens)
results = output.logits.detach().numpy()
results = np.squeeze(results)
print(results.tolist())
```
## Usage (@huggingface/transformers)
```typescript
import { AutoTokenizer, AutoModelForSequenceClassification } from '@huggingface/transformers';
async function runRerankingOriginal() {
try {
const query = "Jak dożyć 100 lat?";
const answers = [
"Trzeba zdrowo się odżywiać i uprawiać sport.",
"Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
"Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
];
const tokenizer = await AutoTokenizer.from_pretrained('axotion/polish-reranker-base-ranknet');
const model = await AutoModelForSequenceClassification.from_pretrained('axotion/polish-reranker-base-ranknet');
const texts = answers.map(answer => `${query}</s></s>${answer}`);
const tokens = tokenizer(texts, { padding: 'longest', max_length: 512, truncation: true, return_tensors: 'js' });
const output = await model(tokens);
const logits = output.logits;
console.log('Raw logits data:', Array.from(logits.data));
} catch (error) {
console.error('Error during reranking:', error);
}
}
runRerankingOriginal();
```
## Evaluation Results
The model achieves **NDCG@10** of **60.32** in the Rerankers category of the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.
## Citation
```bibtex
@article{dadas2024assessing,
title={Assessing generalization capability of text ranking models in Polish},
author={Sławomir Dadas and Małgorzata Grębowiec},
year={2024},
eprint={2402.14318},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` |