sebastian-hofstaetter
commited on
Commit
·
5ddf7ff
1
Parent(s):
160bf93
Add model, tokenizer, & initial model card
Browse files- README.md +113 -0
- config.json +8 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -0
- tokenizer_config.json +1 -0
- vocab.txt +0 -0
README.md
ADDED
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: "en"
|
3 |
+
tags:
|
4 |
+
- dpr
|
5 |
+
- dense-passage-retrieval
|
6 |
+
- knowledge-distillation
|
7 |
+
datasets:
|
8 |
+
- ms_marco
|
9 |
+
---
|
10 |
+
|
11 |
+
# Margin-MSE Trained DistilBERT-Cat (vanilla/mono/concatenated DistilBERT re-ranker)
|
12 |
+
|
13 |
+
We provide a retrieval trained DistilBERT-Cat model (https://arxiv.org/pdf/2004.12832.pdf). Our model is trained with Margin-MSE using a 3 teacher BERT_Cat (concatenated BERT scoring) ensemble on MSMARCO-Passage.
|
14 |
+
|
15 |
+
This instance can be used to **re-rank a candidate set**. The architecure is a 6-layer DistilBERT, with an additional single linear layer at the end.
|
16 |
+
|
17 |
+
If you want to know more about our simple, yet effective knowledge distillation method for efficient information retrieval models for a variety of student architectures that is used for this model instance check out our paper: https://arxiv.org/abs/2010.02666 🎉
|
18 |
+
|
19 |
+
For more information, training data, source code, and a minimal usage example please visit: https://github.com/sebastian-hofstaetter/neural-ranking-kd
|
20 |
+
|
21 |
+
## Configuration
|
22 |
+
|
23 |
+
- fp16 trained, so fp16 inference shouldn't be a problem
|
24 |
+
|
25 |
+
## Model Code
|
26 |
+
|
27 |
+
````python
|
28 |
+
from transformers import AutoTokenizer,AutoModel, PreTrainedModel,PretrainedConfig
|
29 |
+
from typing import Dict
|
30 |
+
import torch
|
31 |
+
|
32 |
+
class BERT_Cat_Config(PretrainedConfig):
|
33 |
+
model_type = "BERT_Cat"
|
34 |
+
bert_model: str
|
35 |
+
trainable: bool = True
|
36 |
+
|
37 |
+
class BERT_Cat(PreTrainedModel):
|
38 |
+
"""
|
39 |
+
The vanilla/mono BERT concatenated (we lovingly refer to as BERT_Cat) architecture
|
40 |
+
-> requires input concatenation before model, so that batched input is possible
|
41 |
+
"""
|
42 |
+
config_class = BERT_Cat_Config
|
43 |
+
base_model_prefix = "bert_model"
|
44 |
+
|
45 |
+
def __init__(self,
|
46 |
+
cfg) -> None:
|
47 |
+
super().__init__(cfg)
|
48 |
+
|
49 |
+
self.bert_model = AutoModel.from_pretrained(cfg.bert_model)
|
50 |
+
|
51 |
+
for p in self.bert_model.parameters():
|
52 |
+
p.requires_grad = cfg.trainable
|
53 |
+
|
54 |
+
self._classification_layer = torch.nn.Linear(self.bert_model.config.hidden_size, 1)
|
55 |
+
|
56 |
+
def forward(self,
|
57 |
+
query_n_doc_sequence):
|
58 |
+
|
59 |
+
vecs = self.bert_model(**query_n_doc_sequence)[0][:,0,:] # assuming a distilbert model here
|
60 |
+
score = self._classification_layer(vecs)
|
61 |
+
return score
|
62 |
+
|
63 |
+
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # honestly not sure if that is the best way to go, but it works :)
|
64 |
+
model = BERT_Cat.from_pretrained("sebastian-hofstaetter/distilbert-cat-margin_mse-T2-msmarco")
|
65 |
+
````
|
66 |
+
|
67 |
+
## Effectiveness on MSMARCO Passage & TREC Deep Learning '19
|
68 |
+
|
69 |
+
We trained our model on the MSMARCO standard ("small"-400K query) training triples with knowledge distillation with a batch size of 32 on a single consumer-grade GPU (11GB memory).
|
70 |
+
|
71 |
+
For re-ranking we used the top-1000 BM25 results.
|
72 |
+
|
73 |
+
### MSMARCO-DEV
|
74 |
+
|
75 |
+
Here, we use the larger 49K query DEV set (same range as the smaller 7K DEV set, minimal changes possible)
|
76 |
+
|
77 |
+
| | MRR@10 | NDCG@10 |
|
78 |
+
|----------------------------------|--------|---------|
|
79 |
+
| BM25 | .194 | .241 |
|
80 |
+
| **Margin-MSE DistilBERT_Cat** (Re-ranking) | .391 | .451 |
|
81 |
+
|
82 |
+
### TREC-DL'19
|
83 |
+
|
84 |
+
For MRR we use the recommended binarization point of the graded relevance of 2. This might skew the results when compared to other binarization point numbers.
|
85 |
+
|
86 |
+
| | MRR@10 | NDCG@10 |
|
87 |
+
|----------------------------------|--------|---------|
|
88 |
+
| BM25 | .689 | .501 |
|
89 |
+
| **Margin-MSE DistilBERT_Cat** (Re-ranking) | .891 | .747 |
|
90 |
+
|
91 |
+
For more metrics, baselines, info and analysis, please see the paper: https://arxiv.org/abs/2010.02666
|
92 |
+
|
93 |
+
## Limitations & Bias
|
94 |
+
|
95 |
+
- The model inherits social biases from both DistilBERT and MSMARCO.
|
96 |
+
|
97 |
+
- The model is only trained on relatively short passages of MSMARCO (avg. 60 words length), so it might struggle with longer text.
|
98 |
+
|
99 |
+
|
100 |
+
## Citation
|
101 |
+
|
102 |
+
If you use our model checkpoint please cite our work as:
|
103 |
+
|
104 |
+
```
|
105 |
+
@misc{hofstaetter2020_crossarchitecture_kd,
|
106 |
+
title={Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation},
|
107 |
+
author={Sebastian Hofst{\"a}tter and Sophia Althammer and Michael Schr{\"o}der and Mete Sertkan and Allan Hanbury},
|
108 |
+
year={2020},
|
109 |
+
eprint={2010.02666},
|
110 |
+
archivePrefix={arXiv},
|
111 |
+
primaryClass={cs.IR}
|
112 |
+
}
|
113 |
+
```
|
config.json
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"architectures": [
|
3 |
+
"BERT_Cat"
|
4 |
+
],
|
5 |
+
"bert_model": "distilbert-base-uncased",
|
6 |
+
"model_type": "BERT_Cat",
|
7 |
+
"trainable": true
|
8 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:390e15c0bbed9cb3c909ffeb6f89b910db673e12f23e9c2366d9c9c4e267ed2d
|
3 |
+
size 265477723
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "name_or_path": "distilbert-base-uncased"}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|