Davidsamuel101 commited on
Commit
cb41b91
·
verified ·
1 Parent(s): c489690

Add new CrossEncoder model

Browse files
Files changed (7) hide show
  1. README.md +313 -0
  2. config.json +36 -0
  3. model.safetensors +3 -0
  4. special_tokens_map.json +37 -0
  5. tokenizer.json +0 -0
  6. tokenizer_config.json +58 -0
  7. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,313 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - cross-encoder
5
+ - generated_from_trainer
6
+ - dataset_size:5
7
+ - loss:MultipleNegativesRankingLoss
8
+ base_model: cross-encoder/ms-marco-MiniLM-L12-v2
9
+ pipeline_tag: text-ranking
10
+ library_name: sentence-transformers
11
+ ---
12
+
13
+ # CrossEncoder based on cross-encoder/ms-marco-MiniLM-L12-v2
14
+
15
+ This is a [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) model finetuned from [cross-encoder/ms-marco-MiniLM-L12-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L12-v2) using the [sentence-transformers](https://www.SBERT.net) library. It computes scores for pairs of texts, which can be used for text reranking and semantic search.
16
+
17
+ ## Model Details
18
+
19
+ ### Model Description
20
+ - **Model Type:** Cross Encoder
21
+ - **Base model:** [cross-encoder/ms-marco-MiniLM-L12-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L12-v2) <!-- at revision a34da8fab3ad458d48778dea3276ce729857efaf -->
22
+ - **Maximum Sequence Length:** 512 tokens
23
+ - **Number of Output Labels:** 1 label
24
+ <!-- - **Training Dataset:** Unknown -->
25
+ <!-- - **Language:** Unknown -->
26
+ <!-- - **License:** Unknown -->
27
+
28
+ ### Model Sources
29
+
30
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
31
+ - **Documentation:** [Cross Encoder Documentation](https://www.sbert.net/docs/cross_encoder/usage/usage.html)
32
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
33
+ - **Hugging Face:** [Cross Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=cross-encoder)
34
+
35
+ ## Usage
36
+
37
+ ### Direct Usage (Sentence Transformers)
38
+
39
+ First install the Sentence Transformers library:
40
+
41
+ ```bash
42
+ pip install -U sentence-transformers
43
+ ```
44
+
45
+ Then you can load this model and run inference.
46
+ ```python
47
+ from sentence_transformers import CrossEncoder
48
+
49
+ # Download from the 🤗 Hub
50
+ model = CrossEncoder("Davidsamuel101/ft-ms-marco-MiniLM-L12-v2-claims-reranker")
51
+ # Get scores for pairs of texts
52
+ pairs = [
53
+ ["It found the scientists' rigour and honesty are not in doubt, and their behaviour did not prejudice the IPCC's conclusions, though they did fail to display the proper degree of openness.", 'The report, issued on 18 February 2011, cleared the researchers and "did not find any evidence that NOAA inappropriately manipulated data or failed to adhere to appropriate peer review procedures".'],
54
+ ["It found the scientists' rigour and honesty are not in doubt, and their behaviour did not prejudice the IPCC's conclusions, though they did fail to display the proper degree of openness.", 'Ongoing experiments are conducted by more than 4,000 scientists from many nations.'],
55
+ ["It found the scientists' rigour and honesty are not in doubt, and their behaviour did not prejudice the IPCC's conclusions, though they did fail to display the proper degree of openness.", 'Novell did not seem to proceed to a full court case after losing their case there.'],
56
+ ["It found the scientists' rigour and honesty are not in doubt, and their behaviour did not prejudice the IPCC's conclusions, though they did fail to display the proper degree of openness.", 'In the face of determined opposition from the National Park Service and conservation groups, the dam was never built.'],
57
+ ["It found the scientists' rigour and honesty are not in doubt, and their behaviour did not prejudice the IPCC's conclusions, though they did fail to display the proper degree of openness.", 'At Caltech he developed the first instrument able to measure carbon dioxide in atmospheric samples with consistently reliable accuracy.'],
58
+ ]
59
+ scores = model.predict(pairs)
60
+ print(scores.shape)
61
+ # (5,)
62
+
63
+ # Or rank different texts based on similarity to a single text
64
+ ranks = model.rank(
65
+ "It found the scientists' rigour and honesty are not in doubt, and their behaviour did not prejudice the IPCC's conclusions, though they did fail to display the proper degree of openness.",
66
+ [
67
+ 'The report, issued on 18 February 2011, cleared the researchers and "did not find any evidence that NOAA inappropriately manipulated data or failed to adhere to appropriate peer review procedures".',
68
+ 'Ongoing experiments are conducted by more than 4,000 scientists from many nations.',
69
+ 'Novell did not seem to proceed to a full court case after losing their case there.',
70
+ 'In the face of determined opposition from the National Park Service and conservation groups, the dam was never built.',
71
+ 'At Caltech he developed the first instrument able to measure carbon dioxide in atmospheric samples with consistently reliable accuracy.',
72
+ ]
73
+ )
74
+ # [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
75
+ ```
76
+
77
+ <!--
78
+ ### Direct Usage (Transformers)
79
+
80
+ <details><summary>Click to see the direct usage in Transformers</summary>
81
+
82
+ </details>
83
+ -->
84
+
85
+ <!--
86
+ ### Downstream Usage (Sentence Transformers)
87
+
88
+ You can finetune this model on your own dataset.
89
+
90
+ <details><summary>Click to expand</summary>
91
+
92
+ </details>
93
+ -->
94
+
95
+ <!--
96
+ ### Out-of-Scope Use
97
+
98
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
99
+ -->
100
+
101
+ <!--
102
+ ## Bias, Risks and Limitations
103
+
104
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
105
+ -->
106
+
107
+ <!--
108
+ ### Recommendations
109
+
110
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
111
+ -->
112
+
113
+ ## Training Details
114
+
115
+ ### Training Dataset
116
+
117
+ #### Unnamed Dataset
118
+
119
+ * Size: 5 training samples
120
+ * Columns: <code>text1</code>, <code>text2</code>, and <code>label</code>
121
+ * Approximate statistics based on the first 5 samples:
122
+ | | text1 | text2 | label |
123
+ |:--------|:-------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------|:------------------------------------------------|
124
+ | type | string | string | int |
125
+ | details | <ul><li>min: 186 characters</li><li>mean: 186.0 characters</li><li>max: 186 characters</li></ul> | <ul><li>min: 82 characters</li><li>mean: 122.6 characters</li><li>max: 197 characters</li></ul> | <ul><li>0: ~80.00%</li><li>1: ~20.00%</li></ul> |
126
+ * Samples:
127
+ | text1 | text2 | label |
128
+ |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------|
129
+ | <code>It found the scientists' rigour and honesty are not in doubt, and their behaviour did not prejudice the IPCC's conclusions, though they did fail to display the proper degree of openness.</code> | <code>The report, issued on 18 February 2011, cleared the researchers and "did not find any evidence that NOAA inappropriately manipulated data or failed to adhere to appropriate peer review procedures".</code> | <code>1</code> |
130
+ | <code>It found the scientists' rigour and honesty are not in doubt, and their behaviour did not prejudice the IPCC's conclusions, though they did fail to display the proper degree of openness.</code> | <code>Ongoing experiments are conducted by more than 4,000 scientists from many nations.</code> | <code>0</code> |
131
+ | <code>It found the scientists' rigour and honesty are not in doubt, and their behaviour did not prejudice the IPCC's conclusions, though they did fail to display the proper degree of openness.</code> | <code>Novell did not seem to proceed to a full court case after losing their case there.</code> | <code>0</code> |
132
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#multiplenegativesrankingloss) with these parameters:
133
+ ```json
134
+ {
135
+ "scale": 10.0,
136
+ "num_negatives": 4,
137
+ "activation_fn": "torch.nn.modules.activation.Sigmoid"
138
+ }
139
+ ```
140
+
141
+ ### Training Hyperparameters
142
+ #### Non-Default Hyperparameters
143
+
144
+ - `eval_strategy`: steps
145
+ - `per_device_train_batch_size`: 16
146
+ - `learning_rate`: 1e-05
147
+ - `num_train_epochs`: 10
148
+ - `bf16`: True
149
+ - `load_best_model_at_end`: True
150
+
151
+ #### All Hyperparameters
152
+ <details><summary>Click to expand</summary>
153
+
154
+ - `overwrite_output_dir`: False
155
+ - `do_predict`: False
156
+ - `eval_strategy`: steps
157
+ - `prediction_loss_only`: True
158
+ - `per_device_train_batch_size`: 16
159
+ - `per_device_eval_batch_size`: 8
160
+ - `per_gpu_train_batch_size`: None
161
+ - `per_gpu_eval_batch_size`: None
162
+ - `gradient_accumulation_steps`: 1
163
+ - `eval_accumulation_steps`: None
164
+ - `torch_empty_cache_steps`: None
165
+ - `learning_rate`: 1e-05
166
+ - `weight_decay`: 0.0
167
+ - `adam_beta1`: 0.9
168
+ - `adam_beta2`: 0.999
169
+ - `adam_epsilon`: 1e-08
170
+ - `max_grad_norm`: 1.0
171
+ - `num_train_epochs`: 10
172
+ - `max_steps`: -1
173
+ - `lr_scheduler_type`: linear
174
+ - `lr_scheduler_kwargs`: {}
175
+ - `warmup_ratio`: 0.0
176
+ - `warmup_steps`: 0
177
+ - `log_level`: passive
178
+ - `log_level_replica`: warning
179
+ - `log_on_each_node`: True
180
+ - `logging_nan_inf_filter`: True
181
+ - `save_safetensors`: True
182
+ - `save_on_each_node`: False
183
+ - `save_only_model`: False
184
+ - `restore_callback_states_from_checkpoint`: False
185
+ - `no_cuda`: False
186
+ - `use_cpu`: False
187
+ - `use_mps_device`: False
188
+ - `seed`: 42
189
+ - `data_seed`: None
190
+ - `jit_mode_eval`: False
191
+ - `use_ipex`: False
192
+ - `bf16`: True
193
+ - `fp16`: False
194
+ - `fp16_opt_level`: O1
195
+ - `half_precision_backend`: auto
196
+ - `bf16_full_eval`: False
197
+ - `fp16_full_eval`: False
198
+ - `tf32`: None
199
+ - `local_rank`: 0
200
+ - `ddp_backend`: None
201
+ - `tpu_num_cores`: None
202
+ - `tpu_metrics_debug`: False
203
+ - `debug`: []
204
+ - `dataloader_drop_last`: False
205
+ - `dataloader_num_workers`: 0
206
+ - `dataloader_prefetch_factor`: None
207
+ - `past_index`: -1
208
+ - `disable_tqdm`: False
209
+ - `remove_unused_columns`: True
210
+ - `label_names`: None
211
+ - `load_best_model_at_end`: True
212
+ - `ignore_data_skip`: False
213
+ - `fsdp`: []
214
+ - `fsdp_min_num_params`: 0
215
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
216
+ - `tp_size`: 0
217
+ - `fsdp_transformer_layer_cls_to_wrap`: None
218
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
219
+ - `deepspeed`: None
220
+ - `label_smoothing_factor`: 0.0
221
+ - `optim`: adamw_torch
222
+ - `optim_args`: None
223
+ - `adafactor`: False
224
+ - `group_by_length`: False
225
+ - `length_column_name`: length
226
+ - `ddp_find_unused_parameters`: None
227
+ - `ddp_bucket_cap_mb`: None
228
+ - `ddp_broadcast_buffers`: False
229
+ - `dataloader_pin_memory`: True
230
+ - `dataloader_persistent_workers`: False
231
+ - `skip_memory_metrics`: True
232
+ - `use_legacy_prediction_loop`: False
233
+ - `push_to_hub`: False
234
+ - `resume_from_checkpoint`: None
235
+ - `hub_model_id`: None
236
+ - `hub_strategy`: every_save
237
+ - `hub_private_repo`: None
238
+ - `hub_always_push`: False
239
+ - `gradient_checkpointing`: False
240
+ - `gradient_checkpointing_kwargs`: None
241
+ - `include_inputs_for_metrics`: False
242
+ - `include_for_metrics`: []
243
+ - `eval_do_concat_batches`: True
244
+ - `fp16_backend`: auto
245
+ - `push_to_hub_model_id`: None
246
+ - `push_to_hub_organization`: None
247
+ - `mp_parameters`:
248
+ - `auto_find_batch_size`: False
249
+ - `full_determinism`: False
250
+ - `torchdynamo`: None
251
+ - `ray_scope`: last
252
+ - `ddp_timeout`: 1800
253
+ - `torch_compile`: False
254
+ - `torch_compile_backend`: None
255
+ - `torch_compile_mode`: None
256
+ - `include_tokens_per_second`: False
257
+ - `include_num_input_tokens_seen`: False
258
+ - `neftune_noise_alpha`: None
259
+ - `optim_target_modules`: None
260
+ - `batch_eval_metrics`: False
261
+ - `eval_on_start`: False
262
+ - `use_liger_kernel`: False
263
+ - `eval_use_gather_object`: False
264
+ - `average_tokens_across_devices`: False
265
+ - `prompts`: None
266
+ - `batch_sampler`: batch_sampler
267
+ - `multi_dataset_batch_sampler`: proportional
268
+
269
+ </details>
270
+
271
+ ### Framework Versions
272
+ - Python: 3.13.2
273
+ - Sentence Transformers: 4.1.0
274
+ - Transformers: 4.51.3
275
+ - PyTorch: 2.7.0+cu128
276
+ - Accelerate: 1.6.0
277
+ - Datasets: 3.6.0
278
+ - Tokenizers: 0.21.1
279
+
280
+ ## Citation
281
+
282
+ ### BibTeX
283
+
284
+ #### Sentence Transformers
285
+ ```bibtex
286
+ @inproceedings{reimers-2019-sentence-bert,
287
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
288
+ author = "Reimers, Nils and Gurevych, Iryna",
289
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
290
+ month = "11",
291
+ year = "2019",
292
+ publisher = "Association for Computational Linguistics",
293
+ url = "https://arxiv.org/abs/1908.10084",
294
+ }
295
+ ```
296
+
297
+ <!--
298
+ ## Glossary
299
+
300
+ *Clearly define terms in order to be accessible across audiences.*
301
+ -->
302
+
303
+ <!--
304
+ ## Model Card Authors
305
+
306
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
307
+ -->
308
+
309
+ <!--
310
+ ## Model Card Contact
311
+
312
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
313
+ -->
config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "gradient_checkpointing": false,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 384,
11
+ "id2label": {
12
+ "0": "LABEL_0"
13
+ },
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 1536,
16
+ "label2id": {
17
+ "LABEL_0": 0
18
+ },
19
+ "layer_norm_eps": 1e-12,
20
+ "max_position_embeddings": 512,
21
+ "model_type": "bert",
22
+ "num_attention_heads": 12,
23
+ "num_hidden_layers": 12,
24
+ "pad_token_id": 0,
25
+ "position_embedding_type": "absolute",
26
+ "sbert_ce_default_activation_function": "torch.nn.modules.linear.Identity",
27
+ "sentence_transformers": {
28
+ "activation_fn": "torch.nn.modules.activation.Sigmoid",
29
+ "version": "4.1.0"
30
+ },
31
+ "torch_dtype": "float32",
32
+ "transformers_version": "4.51.3",
33
+ "type_vocab_size": 2,
34
+ "use_cache": true,
35
+ "vocab_size": 30522
36
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:53ea43205a18064214a0d2fc95d1b8c72cf070ec3980bbec152cb772ceef9e9a
3
+ size 133464836
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "model_max_length": 512,
51
+ "never_split": null,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "strip_accents": null,
55
+ "tokenize_chinese_chars": true,
56
+ "tokenizer_class": "BertTokenizer",
57
+ "unk_token": "[UNK]"
58
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff