davidmezzetti commited on
Commit
8f83317
·
1 Parent(s): 9aca0b5

Upload model

Browse files
1_Dense/config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"in_features": 768, "out_features": 128, "bias": false, "activation_function": "torch.nn.modules.linear.Identity"}
1_Dense/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d54c6b56e094486487b936026e753f82f01e3b833222f65fd2a6334fbfab822e
3
+ size 393304
README.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - ColBERT
4
+ - PyLate
5
+ - sentence-transformers
6
+ - sentence-similarity
7
+ - feature-extraction
8
+ - generated_from_trainer
9
+ - loss:Contrastive
10
+ base_model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
11
+ pipeline_tag: sentence-similarity
12
+ library_name: PyLate
13
+ metrics:
14
+ - accuracy
15
+ model-index:
16
+ - name: PyLate model based on microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
17
+ results:
18
+ - task:
19
+ type: col-berttriplet
20
+ name: Col BERTTriplet
21
+ dataset:
22
+ name: Unknown
23
+ type: unknown
24
+ metrics:
25
+ - type: accuracy
26
+ value: 0.9996359348297119
27
+ name: Accuracy
28
+ language: en
29
+ license: apache-2.0
30
+ ---
31
+
32
+ # PubMedBERT ColBERT
33
+
34
+ This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext). It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
35
+
36
+ ## Usage (txtai)
37
+
38
+ This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).
39
+
40
+ _Note: txtai 9.0+ is required for late interaction model support_
41
+
42
+ ```python
43
+ import txtai
44
+
45
+ embeddings = txtai.Embeddings(
46
+ sparse="neuml/pubmedbert-base-colbert",
47
+ content=True
48
+ )
49
+ embeddings.index(documents())
50
+
51
+ # Run a query
52
+ embeddings.search("query to run")
53
+ ```
54
+
55
+ Late interaction models excel as reranker pipelines.
56
+
57
+ ```python
58
+ from txtai.pipeline import Reranker, Similarity
59
+
60
+ similarity = Similarity(path="neuml/pubmedbert-base-colbert", lateencode=True)
61
+ ranker = Reranker(embeddings, similarity)
62
+ ranker("query to run")
63
+ ```
64
+
65
+ ## Usage (PyLate)
66
+
67
+ Alternatively, the model can be loaded with [PyLate](https://github.com/lightonai/pylate).
68
+
69
+ ```python
70
+ from pylate import rank, models
71
+
72
+ queries = [
73
+ "query A",
74
+ "query B",
75
+ ]
76
+
77
+ documents = [
78
+ ["document A", "document B"],
79
+ ["document 1", "document C", "document B"],
80
+ ]
81
+
82
+ documents_ids = [
83
+ [1, 2],
84
+ [1, 3, 2],
85
+ ]
86
+
87
+ model = models.ColBERT(
88
+ model_name_or_path=pylate_model_id,
89
+ )
90
+
91
+ queries_embeddings = model.encode(
92
+ queries,
93
+ is_query=True,
94
+ )
95
+
96
+ documents_embeddings = model.encode(
97
+ documents,
98
+ is_query=False,
99
+ )
100
+
101
+ reranked_documents = rank.rerank(
102
+ documents_ids=documents_ids,
103
+ queries_embeddings=queries_embeddings,
104
+ documents_embeddings=documents_embeddings,
105
+ )
106
+ ```
107
+
108
+ ## Evaluation Results
109
+
110
+ Performance of this model compared to the top base models on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) is shown below. A popular smaller model was also evaluated along with the most downloaded PubMed similarity model on the Hugging Face Hub.
111
+
112
+ The following datasets were used to evaluate model performance.
113
+
114
+ - [PubMed QA](https://huggingface.co/datasets/qiaojin/PubMedQA)
115
+ - Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
116
+ - [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k)
117
+ - Split: test, Pair: (title, text)
118
+ - [PubMed Summary](https://huggingface.co/datasets/armanc/scientific_papers)
119
+ - Subset: pubmed, Split: validation, Pair: (article, abstract)
120
+
121
+ Evaluation results are shown below. The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric.
122
+
123
+ | Model | PubMed QA | PubMed Subset | PubMed Summary | Average |
124
+ | ----------------------------------------------------------------------------- | --------- | ------------- | -------------- | --------- |
125
+ | [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2) | 90.40 | 95.92 | 94.07 | 93.46 |
126
+ | [bge-base-en-v1.5](https://hf.co/BAAI/bge-base-en-v1.5) | 91.02 | 95.82 | 94.49 | 93.78 |
127
+ | [gte-base](https://hf.co/thenlper/gte-base) | 92.97 | 96.90 | 96.24 | 95.37 |
128
+ | [**pubmedbert-base-colbert**](https://hf.co/neuml/pubmedbert-base-colbert) | **93.94** | **97.21** | **95.27** | **95.47** |
129
+ | [**pubmedbert-base-colbert (MUVERA)**](https://hf.co/neuml/pubmedbert-base-colbert) | **88.77** | **93.51** | **95.18** | **92.49** |
130
+ | [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings) | 93.27 | 97.00 | 96.58 | 95.62 |
131
+ | [S-PubMedBert-MS-MARCO](https://hf.co/pritamdeka/S-PubMedBert-MS-MARCO) | 90.86 | 93.68 | 93.54 | 92.69 |
132
+
133
+ While this isn't the highest scoring model, note how it is the best model for the first two datasets, which are retrieval datasets. ColBERT models can be better at picking up on query nuances given that vectors are not mean pooled together.
134
+
135
+ The model also performs well enough for [MUVERA encoding](https://arxiv.org/abs/2405.19504). The goal with MUVERA is "good enough" recall that picks up on the signal and is then paired with a reranker pipeline.
136
+
137
+ ### Full Model Architecture
138
+
139
+ ```
140
+ ColBERT(
141
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
142
+ (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
143
+ )
144
+ ```
added_tokens.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "[D] ": 30523,
3
+ "[Q] ": 30522
4
+ }
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.48.2",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 30524
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "4.0.2",
4
+ "transformers": "4.48.2",
5
+ "pytorch": "2.8.0+cu128"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "MaxSim",
10
+ "query_prefix": "[Q] ",
11
+ "document_prefix": "[D] ",
12
+ "query_length": 512,
13
+ "document_length": 512,
14
+ "attend_to_expansion_tokens": false,
15
+ "skiplist_words": [
16
+ "!",
17
+ "\"",
18
+ "#",
19
+ "$",
20
+ "%",
21
+ "&",
22
+ "'",
23
+ "(",
24
+ ")",
25
+ "*",
26
+ "+",
27
+ ",",
28
+ "-",
29
+ ".",
30
+ "/",
31
+ ":",
32
+ ";",
33
+ "<",
34
+ "=",
35
+ ">",
36
+ "?",
37
+ "@",
38
+ "[",
39
+ "\\",
40
+ "]",
41
+ "^",
42
+ "_",
43
+ "`",
44
+ "{",
45
+ "|",
46
+ "}",
47
+ "~"
48
+ ]
49
+ }
eval/triplet_evaluation_results.csv ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ epoch,steps,accuracy
2
+ 0.13025921583952066,2000,0.9988450407981873
3
+ 0.2605184316790413,4000,0.999037504196167
4
+ 0.3907776475185619,6000,0.9988450407981873
5
+ 0.5210368633580826,8000,0.9994224905967712
6
+ 0.6512960791976032,10000,0.9996150135993958
7
+ 0.7815552950371238,12000,0.9992300271987915
8
+ 0.9118145108766446,14000,0.9992300271987915
9
+ 0.09512937595129375,2000,0.9996359348297119
10
+ 0.1902587519025875,4000,1.0
11
+ 0.2853881278538813,6000,1.0
12
+ 0.380517503805175,8000,1.0
13
+ 0.4756468797564688,10000,0.999817967414856
14
+ 0.5707762557077626,12000,0.999817967414856
15
+ 0.6659056316590564,14000,0.999817967414856
16
+ 0.76103500761035,16000,0.999817967414856
17
+ 0.8561643835616438,18000,0.999817967414856
18
+ 0.9512937595129376,20000,0.999817967414856
19
+ 0.09512937595129375,2000,0.999089777469635
20
+ 0.1902587519025875,4000,0.9994539022445679
21
+ 0.2853881278538813,6000,0.9994539022445679
22
+ 0.380517503805175,8000,0.9994539022445679
23
+ 0.4756468797564688,10000,0.9994539022445679
24
+ 0.5707762557077626,12000,0.9994539022445679
25
+ 0.6659056316590564,14000,0.9994539022445679
26
+ 0.76103500761035,16000,0.9996359348297119
27
+ 0.8561643835616438,18000,0.9996359348297119
28
+ 0.9512937595129376,20000,0.9996359348297119
29
+ 0.09512937595129375,2000,0.9983615875244141
30
+ 0.1902587519025875,4000,0.9987256526947021
31
+ 0.2853881278538813,6000,0.9987256526947021
32
+ 0.380517503805175,8000,0.9996359348297119
33
+ 0.4756468797564688,10000,0.9996359348297119
34
+ 0.5707762557077626,12000,0.999817967414856
35
+ 0.6659056316590564,14000,0.9996359348297119
36
+ 0.76103500761035,16000,1.0
37
+ 0.8561643835616438,18000,0.999817967414856
38
+ 0.9512937595129376,20000,0.999817967414856
39
+ 0.09512937595129375,2000,0.99817955493927
40
+ 0.1902587519025875,4000,0.998907744884491
41
+ 0.2853881278538813,6000,0.998907744884491
42
+ 0.380517503805175,8000,0.9985436201095581
43
+ 0.4756468797564688,10000,0.9983615875244141
44
+ 0.5707762557077626,12000,0.9983615875244141
45
+ 0.6659056316590564,14000,0.99817955493927
46
+ 0.76103500761035,16000,0.9985436201095581
47
+ 0.8561643835616438,18000,0.9983615875244141
48
+ 0.9512937595129376,20000,0.9983615875244141
49
+ 0.09512937595129375,2000,0.998907744884491
50
+ 0.1902587519025875,4000,0.999271810054779
51
+ 0.2853881278538813,6000,0.999089777469635
52
+ 0.380517503805175,8000,0.999089777469635
53
+ 0.4756468797564688,10000,0.9994539022445679
54
+ 0.5707762557077626,12000,0.999271810054779
55
+ 0.6659056316590564,14000,0.999271810054779
56
+ 0.76103500761035,16000,0.999271810054779
57
+ 0.8561643835616438,18000,0.999089777469635
58
+ 0.9512937595129376,20000,0.999089777469635
59
+ 0.09905403397553365,2000,0.998598575592041
60
+ 0.1981080679510673,4000,0.9989989995956421
61
+ 0.297162101926601,6000,0.9991992115974426
62
+ 0.3962161359021346,8000,0.9989989995956421
63
+ 0.49527016987766825,10000,0.9991992115974426
64
+ 0.594324203853202,12000,0.9989989995956421
65
+ 0.6933782378287355,14000,0.9991992115974426
66
+ 0.7924322718042692,16000,0.9989989995956421
67
+ 0.8914863057798029,18000,0.9989989995956421
68
+ 0.9905403397553365,20000,0.9989989995956421
69
+ 0.09512937595129375,2000,0.9983615875244141
70
+ 0.1902587519025875,4000,0.9985436201095581
71
+ 0.2853881278538813,6000,0.9987256526947021
72
+ 0.380517503805175,8000,0.9987256526947021
73
+ 0.4756468797564688,10000,0.999089777469635
74
+ 0.5707762557077626,12000,0.999271810054779
75
+ 0.6659056316590564,14000,0.9994539022445679
76
+ 0.76103500761035,16000,0.998907744884491
77
+ 0.8561643835616438,18000,0.9994539022445679
78
+ 0.9512937595129376,20000,0.999271810054779
79
+ 0.09512937595129375,2000,0.999089777469635
80
+ 0.1902587519025875,4000,0.999089777469635
81
+ 0.2853881278538813,6000,0.999271810054779
82
+ 0.380517503805175,8000,0.999271810054779
83
+ 0.4756468797564688,10000,0.9994539022445679
84
+ 0.5707762557077626,12000,0.999817967414856
85
+ 0.6659056316590564,14000,0.9996359348297119
86
+ 0.76103500761035,16000,0.999817967414856
87
+ 0.8561643835616438,18000,0.999817967414856
88
+ 0.9512937595129376,20000,0.9996359348297119
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a2a46d33500d2dda112f63014cec8d3314497b6fe6e378e23bc21b754d6a3c57
3
+ size 437957472
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Dense",
12
+ "type": "pylate.models.Dense.Dense"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[MASK]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "30522": {
44
+ "content": "[Q] ",
45
+ "lstrip": false,
46
+ "normalized": true,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": false
50
+ },
51
+ "30523": {
52
+ "content": "[D] ",
53
+ "lstrip": false,
54
+ "normalized": true,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": false
58
+ }
59
+ },
60
+ "clean_up_tokenization_spaces": true,
61
+ "cls_token": "[CLS]",
62
+ "do_basic_tokenize": true,
63
+ "do_lower_case": true,
64
+ "extra_special_tokens": {},
65
+ "mask_token": "[MASK]",
66
+ "model_max_length": 1000000000000000019884624838656,
67
+ "never_split": null,
68
+ "pad_token": "[MASK]",
69
+ "sep_token": "[SEP]",
70
+ "strip_accents": null,
71
+ "tokenize_chinese_chars": true,
72
+ "tokenizer_class": "BertTokenizer",
73
+ "unk_token": "[UNK]"
74
+ }
triplet_evaluation_results.csv ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ epoch,steps,accuracy
2
+ -1,-1,0.9998083710670471
3
+ -1,-1,0.9999090433120728
4
+ -1,-1,0.9998180866241455
5
+ -1,-1,0.999454140663147
6
+ -1,-1,0.9991812109947205
7
+ -1,-1,0.9993631839752197
8
+ -1,-1,0.999399721622467
9
+ -1,-1,0.9992722272872925
10
+ -1,-1,0.9998180866241455
vocab.txt ADDED
The diff for this file is too large to render. See raw diff