Upload model

Browse files

Files changed (15) hide show

1_Dense/config.json +1 -0
1_Dense/model.safetensors +3 -0
README.md +144 -0
added_tokens.json +4 -0
config.json +25 -0
config_sentence_transformers.json +49 -0
eval/triplet_evaluation_results.csv +88 -0
model.safetensors +3 -0
modules.json +14 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +74 -0
triplet_evaluation_results.csv +10 -0
vocab.txt +0 -0

1_Dense/config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"in_features": 768, "out_features": 128, "bias": false, "activation_function": "torch.nn.modules.linear.Identity"}

1_Dense/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d54c6b56e094486487b936026e753f82f01e3b833222f65fd2a6334fbfab822e
+size 393304

README.md ADDED Viewed

	@@ -0,0 +1,144 @@

+---
+tags:
+- ColBERT
+- PyLate
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- generated_from_trainer
+- loss:Contrastive
+base_model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
+pipeline_tag: sentence-similarity
+library_name: PyLate
+metrics:
+- accuracy
+model-index:
+- name: PyLate model based on microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
+  results:
+  - task:
+      type: col-berttriplet
+      name: Col BERTTriplet
+    dataset:
+      name: Unknown
+      type: unknown
+    metrics:
+    - type: accuracy
+      value: 0.9996359348297119
+      name: Accuracy
+language: en
+license: apache-2.0
+---
+# PubMedBERT ColBERT
+This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext). It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
+## Usage (txtai)
+This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).
+_Note: txtai 9.0+ is required for late interaction model support_
+```python
+import txtai
+embeddings = txtai.Embeddings(
+  sparse="neuml/pubmedbert-base-colbert",
+  content=True
+)
+embeddings.index(documents())
+# Run a query
+embeddings.search("query to run")
+```
+Late interaction models excel as reranker pipelines.
+```python
+from txtai.pipeline import Reranker, Similarity
+similarity = Similarity(path="neuml/pubmedbert-base-colbert", lateencode=True)
+ranker = Reranker(embeddings, similarity)
+ranker("query to run")
+```
+## Usage (PyLate)
+Alternatively, the model can be loaded with [PyLate](https://github.com/lightonai/pylate).
+```python
+from pylate import rank, models
+queries = [
+    "query A",
+    "query B",
+]
+documents = [
+    ["document A", "document B"],
+    ["document 1", "document C", "document B"],
+]
+documents_ids = [
+    [1, 2],
+    [1, 3, 2],
+]
+model = models.ColBERT(
+    model_name_or_path=pylate_model_id,
+)
+queries_embeddings = model.encode(
+    queries,
+    is_query=True,
+)
+documents_embeddings = model.encode(
+    documents,
+    is_query=False,
+)
+reranked_documents = rank.rerank(
+    documents_ids=documents_ids,
+    queries_embeddings=queries_embeddings,
+    documents_embeddings=documents_embeddings,
+)
+```
+## Evaluation Results
+Performance of this model compared to the top base models on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) is shown below. A popular smaller model was also evaluated along with the most downloaded PubMed similarity model on the Hugging Face Hub.
+The following datasets were used to evaluate model performance.
+- [PubMed QA](https://huggingface.co/datasets/qiaojin/PubMedQA)
+  - Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
+- [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k)
+  - Split: test, Pair: (title, text)
+- [PubMed Summary](https://huggingface.co/datasets/armanc/scientific_papers)
+  - Subset: pubmed, Split: validation, Pair: (article, abstract)
+Evaluation results are shown below. The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric.
+| Model                                                                         | PubMed QA | PubMed Subset | PubMed Summary | Average   |
+| ----------------------------------------------------------------------------- | --------- | ------------- | -------------- | --------- |
+| [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2)           | 90.40     | 95.92         | 94.07          | 93.46     |
+| [bge-base-en-v1.5](https://hf.co/BAAI/bge-base-en-v1.5)                            | 91.02     | 95.82         | 94.49          | 93.78     |
+| [gte-base](https://hf.co/thenlper/gte-base)                                        | 92.97     | 96.90         | 96.24          | 95.37     |
+| [**pubmedbert-base-colbert**](https://hf.co/neuml/pubmedbert-base-colbert)       | **93.94**     | **97.21**         | **95.27**          | **95.47**     |
+| [**pubmedbert-base-colbert (MUVERA)**](https://hf.co/neuml/pubmedbert-base-colbert)       | **88.77**     | **93.51**         | **95.18**          | **92.49**     |
+| [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings)       | 93.27     | 97.00         | 96.58          | 95.62     |
+| [S-PubMedBert-MS-MARCO](https://hf.co/pritamdeka/S-PubMedBert-MS-MARCO)            | 90.86     | 93.68         | 93.54          | 92.69     |
+While this isn't the highest scoring model, note how it is the best model for the first two datasets, which are retrieval datasets. ColBERT models can be better at picking up on query nuances given that vectors are not mean pooled together.
+The model also performs well enough for [MUVERA encoding](https://arxiv.org/abs/2405.19504). The goal with MUVERA is "good enough" recall that picks up on the signal and is then paired with a reranker pipeline.
+### Full Model Architecture
+```
+ColBERT(
+  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
+  (1): Dense({'in_features': 768, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
+)
+```

added_tokens.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "[D] ": 30523,
+  "[Q] ": 30522
+}

config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "_name_or_path": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.48.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30524
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,49 @@

+{
+  "__version__": {
+    "sentence_transformers": "4.0.2",
+    "transformers": "4.48.2",
+    "pytorch": "2.8.0+cu128"
+  },
+  "prompts": {},
+  "default_prompt_name": null,
+  "similarity_fn_name": "MaxSim",
+  "query_prefix": "[Q] ",
+  "document_prefix": "[D] ",
+  "query_length": 512,
+  "document_length": 512,
+  "attend_to_expansion_tokens": false,
+  "skiplist_words": [
+    "!",
+    "\"",
+    "#",
+    "$",
+    "%",
+    "&",
+    "'",
+    "(",
+    ")",
+    "*",
+    "+",
+    ",",
+    "-",
+    ".",
+    "/",
+    ":",
+    ";",
+    "<",
+    "=",
+    ">",
+    "?",
+    "@",
+    "[",
+    "\\",
+    "]",
+    "^",
+    "_",
+    "`",
+    "{",
+    "|",
+    "}",
+    "~"
+  ]
+}

eval/triplet_evaluation_results.csv ADDED Viewed

	@@ -0,0 +1,88 @@

+epoch,steps,accuracy
+0.13025921583952066,2000,0.9988450407981873
+0.2605184316790413,4000,0.999037504196167
+0.3907776475185619,6000,0.9988450407981873
+0.5210368633580826,8000,0.9994224905967712
+0.6512960791976032,10000,0.9996150135993958
+0.7815552950371238,12000,0.9992300271987915
+0.9118145108766446,14000,0.9992300271987915
+0.09512937595129375,2000,0.9996359348297119
+0.1902587519025875,4000,1.0
+0.2853881278538813,6000,1.0
+0.380517503805175,8000,1.0
+0.4756468797564688,10000,0.999817967414856
+0.5707762557077626,12000,0.999817967414856
+0.6659056316590564,14000,0.999817967414856
+0.76103500761035,16000,0.999817967414856
+0.8561643835616438,18000,0.999817967414856
+0.9512937595129376,20000,0.999817967414856
+0.09512937595129375,2000,0.999089777469635
+0.1902587519025875,4000,0.9994539022445679
+0.2853881278538813,6000,0.9994539022445679
+0.380517503805175,8000,0.9994539022445679
+0.4756468797564688,10000,0.9994539022445679
+0.5707762557077626,12000,0.9994539022445679
+0.6659056316590564,14000,0.9994539022445679
+0.76103500761035,16000,0.9996359348297119
+0.8561643835616438,18000,0.9996359348297119
+0.9512937595129376,20000,0.9996359348297119
+0.09512937595129375,2000,0.9983615875244141
+0.1902587519025875,4000,0.9987256526947021
+0.2853881278538813,6000,0.9987256526947021
+0.380517503805175,8000,0.9996359348297119
+0.4756468797564688,10000,0.9996359348297119
+0.5707762557077626,12000,0.999817967414856
+0.6659056316590564,14000,0.9996359348297119
+0.76103500761035,16000,1.0
+0.8561643835616438,18000,0.999817967414856
+0.9512937595129376,20000,0.999817967414856
+0.09512937595129375,2000,0.99817955493927
+0.1902587519025875,4000,0.998907744884491
+0.2853881278538813,6000,0.998907744884491
+0.380517503805175,8000,0.9985436201095581
+0.4756468797564688,10000,0.9983615875244141
+0.5707762557077626,12000,0.9983615875244141
+0.6659056316590564,14000,0.99817955493927
+0.76103500761035,16000,0.9985436201095581
+0.8561643835616438,18000,0.9983615875244141
+0.9512937595129376,20000,0.9983615875244141
+0.09512937595129375,2000,0.998907744884491
+0.1902587519025875,4000,0.999271810054779
+0.2853881278538813,6000,0.999089777469635
+0.380517503805175,8000,0.999089777469635
+0.4756468797564688,10000,0.9994539022445679
+0.5707762557077626,12000,0.999271810054779
+0.6659056316590564,14000,0.999271810054779
+0.76103500761035,16000,0.999271810054779
+0.8561643835616438,18000,0.999089777469635
+0.9512937595129376,20000,0.999089777469635
+0.09905403397553365,2000,0.998598575592041
+0.1981080679510673,4000,0.9989989995956421
+0.297162101926601,6000,0.9991992115974426
+0.3962161359021346,8000,0.9989989995956421
+0.49527016987766825,10000,0.9991992115974426
+0.594324203853202,12000,0.9989989995956421
+0.6933782378287355,14000,0.9991992115974426
+0.7924322718042692,16000,0.9989989995956421
+0.8914863057798029,18000,0.9989989995956421
+0.9905403397553365,20000,0.9989989995956421
+0.09512937595129375,2000,0.9983615875244141
+0.1902587519025875,4000,0.9985436201095581
+0.2853881278538813,6000,0.9987256526947021
+0.380517503805175,8000,0.9987256526947021
+0.4756468797564688,10000,0.999089777469635
+0.5707762557077626,12000,0.999271810054779
+0.6659056316590564,14000,0.9994539022445679
+0.76103500761035,16000,0.998907744884491
+0.8561643835616438,18000,0.9994539022445679
+0.9512937595129376,20000,0.999271810054779
+0.09512937595129375,2000,0.999089777469635
+0.1902587519025875,4000,0.999089777469635
+0.2853881278538813,6000,0.999271810054779
+0.380517503805175,8000,0.999271810054779
+0.4756468797564688,10000,0.9994539022445679
+0.5707762557077626,12000,0.999817967414856
+0.6659056316590564,14000,0.9996359348297119
+0.76103500761035,16000,0.999817967414856
+0.8561643835616438,18000,0.999817967414856
+0.9512937595129376,20000,0.9996359348297119

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a2a46d33500d2dda112f63014cec8d3314497b6fe6e378e23bc21b754d6a3c57
+size 437957472

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Dense",
+    "type": "pylate.models.Dense.Dense"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 512,
+  "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[MASK]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,74 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "30522": {
+      "content": "[Q] ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "30523": {
+      "content": "[D] ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 1000000000000000019884624838656,
+  "never_split": null,
+  "pad_token": "[MASK]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

triplet_evaluation_results.csv ADDED Viewed

	@@ -0,0 +1,10 @@

+epoch,steps,accuracy
+-1,-1,0.9998083710670471
+-1,-1,0.9999090433120728
+-1,-1,0.9998180866241455
+-1,-1,0.999454140663147
+-1,-1,0.9991812109947205
+-1,-1,0.9993631839752197
+-1,-1,0.999399721622467
+-1,-1,0.9992722272872925
+-1,-1,0.9998180866241455

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff