File size: 5,088 Bytes
b2420a4
31f1527
b2420a4
 
 
 
 
 
 
31f1527
b2420a4
 
 
 
cb02614
 
b2420a4
 
 
 
 
 
cb02614
b2420a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78efea0
b2420a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78efea0
b2420a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
pipeline_tag: text-ranking
tags:
- transformers
- information-retrieval
language: pl
license: apache-2.0
base_model:
- sdadas/polish-reranker-base-ranknet
library_name: sentence-transformers
---

<h1 align="center">polish-reranker-base-ranknet</h1>

### This repository extends the original repository by providing an ONNX version for optimized inference.

This is a Polish text ranking model trained with [RankNet loss](https://icml.cc/Conferences/2015/wp-content/uploads/2015/06/icml_ranking.pdf) on a large dataset of text pairs consisting of 1.4 million queries and 10 million documents. 
The training data included the following parts: 1) The Polish MS MARCO training split (800k queries); 2) The ELI5 dataset translated to Polish (over 500k queries); 3) A collection of Polish medical questions and answers (approximately 100k queries).
As a teacher model, we employed [unicamp-dl/mt5-13b-mmarco-100k](https://huggingface.co/unicamp-dl/mt5-13b-mmarco-100k), a large multilingual reranker based on the MT5-XXL architecture. As a student model, we choose [Polish RoBERTa](https://huggingface.co/sdadas/polish-roberta-base-v2).
Unlike more commonly used pointwise losses, which regard each query-document pair independently, the RankNet method computes loss based on queries and pairs of documents. More specifically, the loss is computed based on the relative order of documents sorted by their relevance to the query.
To train the reranker, we used the teacher model to assess the relevance of the documents extracted in the retrieval stage for each query. We then sorted these documents by the relevance score, obtaining a dataset consisting of queries and ordered lists of 20 documents per query.


## Usage (Sentence-Transformers)

You can use the model like this with [sentence-transformers](https://www.SBERT.net):

```python
from sentence_transformers import CrossEncoder
import torch.nn

query = "Jak dożyć 100 lat?"
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]

model = CrossEncoder(
    "axotion/polish-reranker-base-ranknet",
    default_activation_function=torch.nn.Identity(),
    max_length=512,
    device="cuda" if torch.cuda.is_available() else "cpu"
)
pairs = [[query, answer] for answer in answers]
results = model.predict(pairs)
print(results.tolist())
```

## Usage (Huggingface Transformers)

The model can also be used with Huggingface Transformers in the following way:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np

query = "Jak dożyć 100 lat?"
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]

model_name = "axotion/polish-reranker-base-ranknet"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
texts = [f"{query}</s></s>{answer}" for answer in answers]
tokens = tokenizer(texts, padding="longest", max_length=512, truncation=True, return_tensors="pt")
output = model(**tokens)
results = output.logits.detach().numpy()
results = np.squeeze(results)
print(results.tolist())
```

## Usage (@huggingface/transformers)
```typescript
import { AutoTokenizer, AutoModelForSequenceClassification } from '@huggingface/transformers';

async function runRerankingOriginal() {
  try {

    const query = "Jak dożyć 100 lat?";
    const answers = [
      "Trzeba zdrowo się odżywiać i uprawiać sport.",
      "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
      "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
    ];

    const tokenizer = await AutoTokenizer.from_pretrained('axotion/polish-reranker-base-ranknet');
    const model = await AutoModelForSequenceClassification.from_pretrained('axotion/polish-reranker-base-ranknet');

    const texts = answers.map(answer => `${query}</s></s>${answer}`);
    const tokens = tokenizer(texts, { padding: 'longest', max_length: 512, truncation: true, return_tensors: 'js' });
    const output = await model(tokens);
    const logits = output.logits;
    console.log('Raw logits data:', Array.from(logits.data));
  } catch (error) {
    console.error('Error during reranking:', error);
  }
}

runRerankingOriginal();
```

## Evaluation Results

The model achieves **NDCG@10** of **60.32** in the Rerankers category of the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.

## Citation

```bibtex
@article{dadas2024assessing,
  title={Assessing generalization capability of text ranking models in Polish}, 
  author={Sławomir Dadas and Małgorzata Grębowiec},
  year={2024},
  eprint={2402.14318},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
```