BlackBeenie commited on
Commit
771b80f
·
verified ·
1 Parent(s): 4430241

Add new SentenceTransformer model

Browse files
README.md ADDED
@@ -0,0 +1,506 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:499184
8
+ - loss:MultipleNegativesRankingLoss
9
+ base_model: jxm/cde-small-v2
10
+ widget:
11
+ - source_sentence: Heterozygous Advantage Definition
12
+ sentences:
13
+ - A heterozygote advantage (heterozygous advantage) describes the case in which
14
+ the heterozygote genotype has a higher relative fitness than either the homozygote
15
+ dominant or homozygote recessive genotype.
16
+ - Science Main Index. Animals with an internal skeleton made of bone are called
17
+ vertebrates. Vertebrates include fish, amphibians, reptiles, birds, mammals, primates,
18
+ rodents and marsupials. Although vertebrates represent only a very small percentage
19
+ of all animals, their size and mobility often allow them to dominate their environment.
20
+ - 'By Regina Bailey. Definition: Heterozygous refers to having two different alleles
21
+ for a single trait. Related Terms: Allele, Genes, Homozygous. Examples: The gene
22
+ for seed shape in pea plants exists in two forms, one form or allele for round
23
+ seed shape (R) and the other for wrinkled seed shape (r). heterozygous plant would
24
+ contain the following alleles for seed shape: (Rr). Organisms have two alleles
25
+ for each trait. When the alleles of a pair are heterozygous, one is dominant and
26
+ the other is recessive. Using the previous example, round seed shape (R) is dominant
27
+ and wrinkled seed shape (r) is recessive.'
28
+ - source_sentence: definition of annul
29
+ sentences:
30
+ - "When a celebrity wakes up in Las Vegas with a mysterious wedding ring on her\
31
+ \ finger, the first thing sheâ\x80\x99ll probably want to do is annul the marriage.\
32
+ \ That will declare it invalid and officially cancel the whole deal. Annul, which\
33
+ \ means â\x80\x9Cto cancelâ\x80\x9D or â\x80\x9Cto invalidate,â\x80\x9D is usually\
34
+ \ used in the context of politics or marriage. New government officials often\
35
+ \ want to annul laws and policies of the previous post-holder, effectively reversing\
36
+ \ their work. When you annul a marriage, you are officially declaring it invalid,\
37
+ \ as if it never happened."
38
+ - 'The proper term for Catholic annulment is declaration of nullity: the Church
39
+ declares that the marriage never was valid in the first place. This becomes clearer
40
+ when we compare Catholic annulment to civil divorce. A divorce is effective as
41
+ of the date of the divorce decree.Before that, the couple was still married.nnulment
42
+ for an invalid marriage Catholic annulment means that a couple was never married
43
+ in the sacramental sense. God did not create that unbreakable bond between them
44
+ because the sacrament of marriage was not actually fulfilled. The term annulment
45
+ is actually a little misleading.'
46
+ - Another word for consistent word list. Below are a number of words whose meaning
47
+ is similar to consistent. 1 accordant. 2 compatible. 3 conformable. 4 congruous.
48
+ 5 harmonious. 6 suitable. 7 uniform.
49
+ - source_sentence: how much do peds nurse make
50
+ sentences:
51
+ - Vyvanse is detectable in urine up to 3 days after ingesting Vyvanse. Vyvanse is
52
+ detectable in hair samples for months after ingestion. Though Vyvanse itself only
53
+ stays in your system four hours post-ingestion, the active drug d-amphetamine
54
+ stays in your system for 40 hours.
55
+ - A newly practicing pediatric nurse in the US receives a beginning yearly salary
56
+ of around $31,311 but as he/she gains experience, he/she can anticipate a yearly
57
+ income of up to $81,840. The national hourly rate for Pediatric Nurse is from
58
+ between $15.53 to $35.81 with an average overtime pay of $6.93 to $54.59 per hour.
59
+ - 'Rad Tech Salary: $64,450 a year. Average pay for rad techs is $64,450 per annum,
60
+ which is 35% higher than the US median income. A radiographer makes an average
61
+ of $5,371 per month; $1,239 a week and $30.99 an hour. radiology technologist
62
+ can make more than $87,160 a year depending on many factors like work place, education,
63
+ experience, performance, etc. Working at schools ($74,810) or specialty hospitals
64
+ ($72,410) would help you make more money than other industries. Massachusetts
65
+ is one of the best state based on annual income.'
66
+ - source_sentence: cost of six sigma certification
67
+ sentences:
68
+ - "The Roosevelt Corollary was an addition to the Monroe Doctrine which stated that\
69
+ \ no European countries were allowed to intervene with Latin American affairs.\
70
+ \ The only way that â\x80¦ the U.S was allowed to become involved was if the affairs\
71
+ \ or European countries was threatened."
72
+ - 1 The cost of the certification exams varies per training center, so you still
73
+ need to contact the center nearest you to get the actual price. 2 However, if
74
+ we look at the centers that have published their exam rates, we found that the
75
+ average cost of the exam is between $130 and $170. The costs of these training
76
+ programs could cost anywehre from $1,500 to more than $2,500. 2 For example,
77
+ a training course for AutoCAD being offered by Delta.edu costs $2,595.
78
+ - You can buy this ExpertRating Online Six Sigma Green Belt Certification. leading
79
+ to Certification at a special offer price of only $99.99 which includes the in-depth
80
+ ExpertRating Online Six Sigma Green Belt Courseware and exam fee. The ExpertRating
81
+ Six Sigma Green Belt Certification is by far the best value for money Six Sigma
82
+ Green Belt Certification at $99.99. Worldwide airmail delivery of the hard copy
83
+ Six Sigma Green Belt certificate. The certificate can be used to prove your certified
84
+ status and does not mention the word online.
85
+ - source_sentence: when did jeepers creepers come out
86
+ sentences:
87
+ - Jeepers Creepers Wiki. Creeper. Creeper is a fictional character and the main
88
+ antagonist in the 2001 horror film Jeepers Creepers and its 2003 sequel Jeepers
89
+ Creepers II. It is an ancient, mysterious demon who viciously feeds on the flesh
90
+ and bones of many human beings for 23 days every 23rd spring.
91
+ - Moline, IL,sales tax rate is 7.25%, and the Income tax is 8.92%.
92
+ - ' Creep is a song by the English alternative rock band Radiohead. Radiohead released
93
+ Creep as their debut single in 1992, and it later appeared on their first album,
94
+ Pablo Honey (1993). During its initial release, Creep was not a chart success.'
95
+ pipeline_tag: sentence-similarity
96
+ library_name: sentence-transformers
97
+ ---
98
+
99
+ # SentenceTransformer based on jxm/cde-small-v2
100
+
101
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [jxm/cde-small-v2](https://huggingface.co/jxm/cde-small-v2). It maps sentences & paragraphs to a None-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
102
+
103
+ ## Model Details
104
+
105
+ ### Model Description
106
+ - **Model Type:** Sentence Transformer
107
+ - **Base model:** [jxm/cde-small-v2](https://huggingface.co/jxm/cde-small-v2) <!-- at revision 287bf0ea6ebfecf2339762d0ef28fb846959a8f2 -->
108
+ - **Maximum Sequence Length:** 512 tokens
109
+ - **Output Dimensionality:** None dimensions
110
+ - **Similarity Function:** Cosine Similarity
111
+ <!-- - **Training Dataset:** Unknown -->
112
+ <!-- - **Language:** Unknown -->
113
+ <!-- - **License:** Unknown -->
114
+
115
+ ### Model Sources
116
+
117
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
118
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
119
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
120
+
121
+ ### Full Model Architecture
122
+
123
+ ```
124
+ SentenceTransformer(
125
+ (0): Transformer({}) with Transformer model: ContextualDocumentEmbeddingTransformer
126
+ )
127
+ ```
128
+
129
+ ## Usage
130
+
131
+ ### Direct Usage (Sentence Transformers)
132
+
133
+ First install the Sentence Transformers library:
134
+
135
+ ```bash
136
+ pip install -U sentence-transformers
137
+ ```
138
+
139
+ Then you can load this model and run inference.
140
+ ```python
141
+ from sentence_transformers import SentenceTransformer
142
+
143
+ # Download from the 🤗 Hub
144
+ model = SentenceTransformer("BlackBeenie/cde-small-v2-biencoder-msmarco")
145
+ # Run inference
146
+ sentences = [
147
+ 'when did jeepers creepers come out',
148
+ 'Jeepers Creepers Wiki. Creeper. Creeper is a fictional character and the main antagonist in the 2001 horror film Jeepers Creepers and its 2003 sequel Jeepers Creepers II. It is an ancient, mysterious demon who viciously feeds on the flesh and bones of many human beings for 23 days every 23rd spring.',
149
+ ' Creep is a song by the English alternative rock band Radiohead. Radiohead released Creep as their debut single in 1992, and it later appeared on their first album, Pablo Honey (1993). During its initial release, Creep was not a chart success.',
150
+ ]
151
+ embeddings = model.encode(sentences)
152
+ print(embeddings.shape)
153
+ # [3, 1024]
154
+
155
+ # Get the similarity scores for the embeddings
156
+ similarities = model.similarity(embeddings, embeddings)
157
+ print(similarities.shape)
158
+ # [3, 3]
159
+ ```
160
+
161
+ <!--
162
+ ### Direct Usage (Transformers)
163
+
164
+ <details><summary>Click to see the direct usage in Transformers</summary>
165
+
166
+ </details>
167
+ -->
168
+
169
+ <!--
170
+ ### Downstream Usage (Sentence Transformers)
171
+
172
+ You can finetune this model on your own dataset.
173
+
174
+ <details><summary>Click to expand</summary>
175
+
176
+ </details>
177
+ -->
178
+
179
+ <!--
180
+ ### Out-of-Scope Use
181
+
182
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
183
+ -->
184
+
185
+ <!--
186
+ ## Bias, Risks and Limitations
187
+
188
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
189
+ -->
190
+
191
+ <!--
192
+ ### Recommendations
193
+
194
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
195
+ -->
196
+
197
+ ## Training Details
198
+
199
+ ### Training Dataset
200
+
201
+ #### Unnamed Dataset
202
+
203
+ * Size: 499,184 training samples
204
+ * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>sentence_2</code>
205
+ * Approximate statistics based on the first 1000 samples:
206
+ | | sentence_0 | sentence_1 | sentence_2 |
207
+ |:--------|:---------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|
208
+ | type | string | string | string |
209
+ | details | <ul><li>min: 4 tokens</li><li>mean: 9.26 tokens</li><li>max: 29 tokens</li></ul> | <ul><li>min: 14 tokens</li><li>mean: 81.55 tokens</li><li>max: 203 tokens</li></ul> | <ul><li>min: 16 tokens</li><li>mean: 80.95 tokens</li><li>max: 231 tokens</li></ul> |
210
+ * Samples:
211
+ | sentence_0 | sentence_1 | sentence_2 |
212
+ |:------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
213
+ | <code>what year did the sandy hook incident happen</code> | <code>For Newtown, 2012 Sandy Hook Elementary School shooting is still painful. It's been three years since the terrible day Jimmy Greene’s 6-year-old daughter, Ana Grace Marquez, and 19 other children were murdered in the mass shooting at Sandy Hook Elementary School. But life without Ana, who loved to sing and dance from room to room, continues to be so hard that, in some ways, Dec. 14 is no tougher than any other day for Greene.</code> | <code>Hook is a 1991 Steven Spielberg film starring Dustin Hoffman and Robin Williams. The film's storyline is based on the books written by Sir James Matthew Barrie in 1904 or 1905 and is the sequel to the first book.</code> |
214
+ | <code>what kind of degree do you need to be a medical assistant?</code> | <code>If you choose this path, here is what you need to do: 1 Have a high school diploma or GED. The minimum educational requirement for medical assistants is a high school diploma or equivalency degree. 2 Find a doctor who will provide training.</code> | <code>Many colleges offer two-year associate's degrees or one-year certificate programs in different areas of medical office technology. Certificate areas include billing specialist, medical administrative assistant, and medical transcriptionist. Because of the complexity of medical jargon and operational procedures, many employers prefer these professionals to hold related two-year degrees or complete one-year training programs.</code> |
215
+ | <code>what does usb cord do</code> | <code>The Flash Player is required to see this video. The term USB stands for Universal Serial Bus. USB cable assemblies are some of the most popular cable types available, used mostly to connect computers to peripheral devices such as cameras, camcorders, printers, scanners, and more. Devices manufactured to the current USB Revision 3.0 specification are backward compatible with version 1.1.</code> | <code>The USB 2.0 specification for a Full-Speed/High-Speed cable calls for four wires, two for data and two for power, and a braided outer shield. The USB 3.0 specification calls for a total of 10 wires plus a braided outer shield. Two wires are used for power.</code> |
216
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
217
+ ```json
218
+ {
219
+ "scale": 20.0,
220
+ "similarity_fct": "cos_sim"
221
+ }
222
+ ```
223
+
224
+ ### Training Hyperparameters
225
+ #### Non-Default Hyperparameters
226
+
227
+ - `per_device_train_batch_size`: 32
228
+ - `per_device_eval_batch_size`: 32
229
+ - `fp16`: True
230
+ - `multi_dataset_batch_sampler`: round_robin
231
+
232
+ #### All Hyperparameters
233
+ <details><summary>Click to expand</summary>
234
+
235
+ - `overwrite_output_dir`: False
236
+ - `do_predict`: False
237
+ - `eval_strategy`: no
238
+ - `prediction_loss_only`: True
239
+ - `per_device_train_batch_size`: 32
240
+ - `per_device_eval_batch_size`: 32
241
+ - `per_gpu_train_batch_size`: None
242
+ - `per_gpu_eval_batch_size`: None
243
+ - `gradient_accumulation_steps`: 1
244
+ - `eval_accumulation_steps`: None
245
+ - `torch_empty_cache_steps`: None
246
+ - `learning_rate`: 5e-05
247
+ - `weight_decay`: 0.0
248
+ - `adam_beta1`: 0.9
249
+ - `adam_beta2`: 0.999
250
+ - `adam_epsilon`: 1e-08
251
+ - `max_grad_norm`: 1
252
+ - `num_train_epochs`: 3
253
+ - `max_steps`: -1
254
+ - `lr_scheduler_type`: linear
255
+ - `lr_scheduler_kwargs`: {}
256
+ - `warmup_ratio`: 0.0
257
+ - `warmup_steps`: 0
258
+ - `log_level`: passive
259
+ - `log_level_replica`: warning
260
+ - `log_on_each_node`: True
261
+ - `logging_nan_inf_filter`: True
262
+ - `save_safetensors`: True
263
+ - `save_on_each_node`: False
264
+ - `save_only_model`: False
265
+ - `restore_callback_states_from_checkpoint`: False
266
+ - `no_cuda`: False
267
+ - `use_cpu`: False
268
+ - `use_mps_device`: False
269
+ - `seed`: 42
270
+ - `data_seed`: None
271
+ - `jit_mode_eval`: False
272
+ - `use_ipex`: False
273
+ - `bf16`: False
274
+ - `fp16`: True
275
+ - `fp16_opt_level`: O1
276
+ - `half_precision_backend`: auto
277
+ - `bf16_full_eval`: False
278
+ - `fp16_full_eval`: False
279
+ - `tf32`: None
280
+ - `local_rank`: 0
281
+ - `ddp_backend`: None
282
+ - `tpu_num_cores`: None
283
+ - `tpu_metrics_debug`: False
284
+ - `debug`: []
285
+ - `dataloader_drop_last`: False
286
+ - `dataloader_num_workers`: 0
287
+ - `dataloader_prefetch_factor`: None
288
+ - `past_index`: -1
289
+ - `disable_tqdm`: False
290
+ - `remove_unused_columns`: True
291
+ - `label_names`: None
292
+ - `load_best_model_at_end`: False
293
+ - `ignore_data_skip`: False
294
+ - `fsdp`: []
295
+ - `fsdp_min_num_params`: 0
296
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
297
+ - `tp_size`: 0
298
+ - `fsdp_transformer_layer_cls_to_wrap`: None
299
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
300
+ - `deepspeed`: None
301
+ - `label_smoothing_factor`: 0.0
302
+ - `optim`: adamw_torch
303
+ - `optim_args`: None
304
+ - `adafactor`: False
305
+ - `group_by_length`: False
306
+ - `length_column_name`: length
307
+ - `ddp_find_unused_parameters`: None
308
+ - `ddp_bucket_cap_mb`: None
309
+ - `ddp_broadcast_buffers`: False
310
+ - `dataloader_pin_memory`: True
311
+ - `dataloader_persistent_workers`: False
312
+ - `skip_memory_metrics`: True
313
+ - `use_legacy_prediction_loop`: False
314
+ - `push_to_hub`: False
315
+ - `resume_from_checkpoint`: None
316
+ - `hub_model_id`: None
317
+ - `hub_strategy`: every_save
318
+ - `hub_private_repo`: None
319
+ - `hub_always_push`: False
320
+ - `gradient_checkpointing`: False
321
+ - `gradient_checkpointing_kwargs`: None
322
+ - `include_inputs_for_metrics`: False
323
+ - `include_for_metrics`: []
324
+ - `eval_do_concat_batches`: True
325
+ - `fp16_backend`: auto
326
+ - `push_to_hub_model_id`: None
327
+ - `push_to_hub_organization`: None
328
+ - `mp_parameters`:
329
+ - `auto_find_batch_size`: False
330
+ - `full_determinism`: False
331
+ - `torchdynamo`: None
332
+ - `ray_scope`: last
333
+ - `ddp_timeout`: 1800
334
+ - `torch_compile`: False
335
+ - `torch_compile_backend`: None
336
+ - `torch_compile_mode`: None
337
+ - `dispatch_batches`: None
338
+ - `split_batches`: None
339
+ - `include_tokens_per_second`: False
340
+ - `include_num_input_tokens_seen`: False
341
+ - `neftune_noise_alpha`: None
342
+ - `optim_target_modules`: None
343
+ - `batch_eval_metrics`: False
344
+ - `eval_on_start`: False
345
+ - `use_liger_kernel`: False
346
+ - `eval_use_gather_object`: False
347
+ - `average_tokens_across_devices`: False
348
+ - `prompts`: None
349
+ - `batch_sampler`: batch_sampler
350
+ - `multi_dataset_batch_sampler`: round_robin
351
+
352
+ </details>
353
+
354
+ ### Training Logs
355
+ | Epoch | Step | Training Loss |
356
+ |:------:|:-----:|:-------------:|
357
+ | 0.0321 | 500 | 0.9856 |
358
+ | 0.0641 | 1000 | 0.4499 |
359
+ | 0.0962 | 1500 | 0.3673 |
360
+ | 0.1282 | 2000 | 0.339 |
361
+ | 0.1603 | 2500 | 0.3118 |
362
+ | 0.1923 | 3000 | 0.2929 |
363
+ | 0.2244 | 3500 | 0.2886 |
364
+ | 0.2564 | 4000 | 0.2771 |
365
+ | 0.2885 | 4500 | 0.2762 |
366
+ | 0.3205 | 5000 | 0.2716 |
367
+ | 0.3526 | 5500 | 0.2585 |
368
+ | 0.3846 | 6000 | 0.2631 |
369
+ | 0.4167 | 6500 | 0.2458 |
370
+ | 0.4487 | 7000 | 0.2496 |
371
+ | 0.4808 | 7500 | 0.252 |
372
+ | 0.5128 | 8000 | 0.2399 |
373
+ | 0.5449 | 8500 | 0.2422 |
374
+ | 0.5769 | 9000 | 0.2461 |
375
+ | 0.6090 | 9500 | 0.2314 |
376
+ | 0.6410 | 10000 | 0.2331 |
377
+ | 0.6731 | 10500 | 0.2314 |
378
+ | 0.7051 | 11000 | 0.2302 |
379
+ | 0.7372 | 11500 | 0.235 |
380
+ | 0.7692 | 12000 | 0.2176 |
381
+ | 0.8013 | 12500 | 0.2201 |
382
+ | 0.8333 | 13000 | 0.2206 |
383
+ | 0.8654 | 13500 | 0.222 |
384
+ | 0.8974 | 14000 | 0.2136 |
385
+ | 0.9295 | 14500 | 0.2108 |
386
+ | 0.9615 | 15000 | 0.2102 |
387
+ | 0.9936 | 15500 | 0.2098 |
388
+ | 1.0256 | 16000 | 0.1209 |
389
+ | 1.0577 | 16500 | 0.099 |
390
+ | 1.0897 | 17000 | 0.0944 |
391
+ | 1.1218 | 17500 | 0.0955 |
392
+ | 1.1538 | 18000 | 0.0947 |
393
+ | 1.1859 | 18500 | 0.0953 |
394
+ | 1.2179 | 19000 | 0.0943 |
395
+ | 1.25 | 19500 | 0.0911 |
396
+ | 1.2821 | 20000 | 0.0964 |
397
+ | 1.3141 | 20500 | 0.0933 |
398
+ | 1.3462 | 21000 | 0.0956 |
399
+ | 1.3782 | 21500 | 0.0941 |
400
+ | 1.4103 | 22000 | 0.0903 |
401
+ | 1.4423 | 22500 | 0.0889 |
402
+ | 1.4744 | 23000 | 0.0919 |
403
+ | 1.5064 | 23500 | 0.0917 |
404
+ | 1.5385 | 24000 | 0.0956 |
405
+ | 1.5705 | 24500 | 0.0903 |
406
+ | 1.6026 | 25000 | 0.0931 |
407
+ | 1.6346 | 25500 | 0.0931 |
408
+ | 1.6667 | 26000 | 0.089 |
409
+ | 1.6987 | 26500 | 0.0892 |
410
+ | 1.7308 | 27000 | 0.091 |
411
+ | 1.7628 | 27500 | 0.0892 |
412
+ | 1.7949 | 28000 | 0.0884 |
413
+ | 1.8269 | 28500 | 0.0889 |
414
+ | 1.8590 | 29000 | 0.0877 |
415
+ | 1.8910 | 29500 | 0.0866 |
416
+ | 1.9231 | 30000 | 0.0853 |
417
+ | 1.9551 | 30500 | 0.085 |
418
+ | 1.9872 | 31000 | 0.0867 |
419
+ | 2.0192 | 31500 | 0.055 |
420
+ | 2.0513 | 32000 | 0.0338 |
421
+ | 2.0833 | 32500 | 0.033 |
422
+ | 2.1154 | 33000 | 0.033 |
423
+ | 2.1474 | 33500 | 0.0317 |
424
+ | 2.1795 | 34000 | 0.0323 |
425
+ | 2.2115 | 34500 | 0.0322 |
426
+ | 2.2436 | 35000 | 0.0316 |
427
+ | 2.2756 | 35500 | 0.0314 |
428
+ | 2.3077 | 36000 | 0.0312 |
429
+ | 2.3397 | 36500 | 0.0324 |
430
+ | 2.3718 | 37000 | 0.0324 |
431
+ | 2.4038 | 37500 | 0.0328 |
432
+ | 2.4359 | 38000 | 0.0311 |
433
+ | 2.4679 | 38500 | 0.0312 |
434
+ | 2.5 | 39000 | 0.0312 |
435
+ | 2.5321 | 39500 | 0.0311 |
436
+ | 2.5641 | 40000 | 0.0315 |
437
+ | 2.5962 | 40500 | 0.0308 |
438
+ | 2.6282 | 41000 | 0.0308 |
439
+ | 2.6603 | 41500 | 0.0306 |
440
+ | 2.6923 | 42000 | 0.0313 |
441
+ | 2.7244 | 42500 | 0.0322 |
442
+ | 2.7564 | 43000 | 0.0315 |
443
+ | 2.7885 | 43500 | 0.0311 |
444
+ | 2.8205 | 44000 | 0.0321 |
445
+ | 2.8526 | 44500 | 0.0318 |
446
+ | 2.8846 | 45000 | 0.0305 |
447
+ | 2.9167 | 45500 | 0.031 |
448
+ | 2.9487 | 46000 | 0.032 |
449
+ | 2.9808 | 46500 | 0.0306 |
450
+
451
+
452
+ ### Framework Versions
453
+ - Python: 3.11.12
454
+ - Sentence Transformers: 3.4.1
455
+ - Transformers: 4.50.3
456
+ - PyTorch: 2.6.0+cu124
457
+ - Accelerate: 1.5.2
458
+ - Datasets: 3.5.0
459
+ - Tokenizers: 0.21.1
460
+
461
+ ## Citation
462
+
463
+ ### BibTeX
464
+
465
+ #### Sentence Transformers
466
+ ```bibtex
467
+ @inproceedings{reimers-2019-sentence-bert,
468
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
469
+ author = "Reimers, Nils and Gurevych, Iryna",
470
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
471
+ month = "11",
472
+ year = "2019",
473
+ publisher = "Association for Computational Linguistics",
474
+ url = "https://arxiv.org/abs/1908.10084",
475
+ }
476
+ ```
477
+
478
+ #### MultipleNegativesRankingLoss
479
+ ```bibtex
480
+ @misc{henderson2017efficient,
481
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
482
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
483
+ year={2017},
484
+ eprint={1705.00652},
485
+ archivePrefix={arXiv},
486
+ primaryClass={cs.CL}
487
+ }
488
+ ```
489
+
490
+ <!--
491
+ ## Glossary
492
+
493
+ *Clearly define terms in order to be accessible across audiences.*
494
+ -->
495
+
496
+ <!--
497
+ ## Model Card Authors
498
+
499
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
500
+ -->
501
+
502
+ <!--
503
+ ## Model Card Contact
504
+
505
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
506
+ -->
config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architecture": "transductive",
3
+ "architectures": [
4
+ "ContextualDocumentEmbeddingTransformer"
5
+ ],
6
+ "attn_implementation": null,
7
+ "auto_map": {
8
+ "AutoConfig": "jxm/cde-small-v2--model.ContextualModelConfig",
9
+ "AutoModel": "jxm/cde-small-v2--model.ContextualDocumentEmbeddingTransformer"
10
+ },
11
+ "autoregressive_backbone": false,
12
+ "cache_dir": null,
13
+ "config_name": null,
14
+ "dataset_backbone": null,
15
+ "disable_dropout": true,
16
+ "disable_transductive_rotary_embedding": true,
17
+ "embedder": "answerdotai/ModernBERT-base",
18
+ "embedder_rerank": "sentence-transformers/gtr-t5-base",
19
+ "embedding_output_dim": null,
20
+ "limit_layers": null,
21
+ "limit_layers_first_stage": null,
22
+ "logit_scale": 50.0,
23
+ "max_seq_length": 512,
24
+ "model_revision": "main",
25
+ "pool_ignore_contextual_tokens": true,
26
+ "pool_ignore_instruction_tokens": true,
27
+ "pooling_strategy": "mean",
28
+ "tokenizer_name": null,
29
+ "torch_dtype": "float32",
30
+ "transductive_corpus_size": 512,
31
+ "transductive_sequence_dropout_prob": 0.0,
32
+ "transductive_tie_token_embeddings": false,
33
+ "transductive_tokens_per_document": 1,
34
+ "transformers_version": "4.50.3"
35
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.4.1",
4
+ "transformers": "4.50.3",
5
+ "pytorch": "2.6.0+cu124"
6
+ },
7
+ "prompts": {
8
+ "query": "search_query: ",
9
+ "document": "search_document: "
10
+ },
11
+ "default_prompt_name": null,
12
+ "similarity_fn_name": "cosine"
13
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c2d35de1ac1d398c844b1867693e9f13381422029d5318192237bc1fc291f5d3
3
+ size 1222859872
modules.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers_impl.Transformer",
7
+ "kwargs": [
8
+ "dataset_embeddings"
9
+ ]
10
+ }
11
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {}
sentence_transformers_impl.py ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ import logging
5
+ import os
6
+ from typing import Any, Optional
7
+
8
+ import torch
9
+ from torch import nn
10
+ from transformers import AutoConfig, AutoModel, AutoTokenizer
11
+
12
+ logger = logging.getLogger(__name__)
13
+
14
+
15
+ class Transformer(nn.Module):
16
+ """Hugging Face AutoModel to generate token embeddings.
17
+ Loads the correct class, e.g. BERT / RoBERTa etc.
18
+ Args:
19
+ model_name_or_path: Hugging Face models name
20
+ (https://huggingface.co/models)
21
+ max_seq_length: Truncate any inputs longer than max_seq_length
22
+ model_args: Keyword arguments passed to the Hugging Face
23
+ Transformers model
24
+ tokenizer_args: Keyword arguments passed to the Hugging Face
25
+ Transformers tokenizer
26
+ config_args: Keyword arguments passed to the Hugging Face
27
+ Transformers config
28
+ cache_dir: Cache dir for Hugging Face Transformers to store/load
29
+ models
30
+ do_lower_case: If true, lowercases the input (independent if the
31
+ model is cased or not)
32
+ tokenizer_name_or_path: Name or path of the tokenizer. When
33
+ None, then model_name_or_path is used
34
+ backend: Backend used for model inference. Can be `torch`, `onnx`,
35
+ or `openvino`. Default is `torch`.
36
+ """
37
+
38
+ save_in_root: bool = True
39
+
40
+ def __init__(
41
+ self,
42
+ model_name_or_path: str,
43
+ model_args: dict[str, Any] | None = None,
44
+ tokenizer_args: dict[str, Any] | None = None,
45
+ config_args: dict[str, Any] | None = None,
46
+ cache_dir: str | None = None,
47
+ **kwargs,
48
+ ) -> None:
49
+ super().__init__()
50
+ if model_args is None:
51
+ model_args = {}
52
+ if tokenizer_args is None:
53
+ tokenizer_args = {}
54
+ if config_args is None:
55
+ config_args = {}
56
+
57
+ if not model_args.get("trust_remote_code", False):
58
+ raise ValueError(
59
+ "You need to set `trust_remote_code=True` to load this model."
60
+ )
61
+
62
+ self.config = AutoConfig.from_pretrained(model_name_or_path, **config_args, cache_dir=cache_dir)
63
+ self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=self.config, cache_dir=cache_dir, **model_args)
64
+
65
+ self.tokenizer = AutoTokenizer.from_pretrained(
66
+ "answerdotai/ModernBERT-base",
67
+ cache_dir=cache_dir,
68
+ **tokenizer_args,
69
+ )
70
+
71
+ def __repr__(self) -> str:
72
+ return f"Transformer({self.get_config_dict()}) with Transformer model: {self.auto_model.__class__.__name__} "
73
+
74
+ def forward(self, features: dict[str, torch.Tensor], dataset_embeddings: Optional[torch.Tensor] = None, **kwargs) -> dict[str, torch.Tensor]:
75
+ """Returns token_embeddings, cls_token"""
76
+ # If we don't have embeddings, then run the 1st stage model.
77
+ # If we do, then run the 2nd stage model.
78
+ if dataset_embeddings is None:
79
+ sentence_embedding = self.auto_model.first_stage_model(
80
+ input_ids=features["input_ids"],
81
+ attention_mask=features["attention_mask"],
82
+ )
83
+ else:
84
+ sentence_embedding = self.auto_model.second_stage_model(
85
+ input_ids=features["input_ids"],
86
+ attention_mask=features["attention_mask"],
87
+ dataset_embeddings=dataset_embeddings,
88
+ )
89
+
90
+ features["sentence_embedding"] = sentence_embedding
91
+ return features
92
+
93
+ def get_word_embedding_dimension(self) -> int:
94
+ return self.auto_model.config.hidden_size
95
+
96
+ def tokenize(
97
+ self, texts: list[str] | list[dict] | list[tuple[str, str]], padding: str | bool = True
98
+ ) -> dict[str, torch.Tensor]:
99
+ """Tokenizes a text and maps tokens to token-ids"""
100
+ output = {}
101
+ if isinstance(texts[0], str):
102
+ to_tokenize = [texts]
103
+ elif isinstance(texts[0], dict):
104
+ to_tokenize = []
105
+ output["text_keys"] = []
106
+ for lookup in texts:
107
+ text_key, text = next(iter(lookup.items()))
108
+ to_tokenize.append(text)
109
+ output["text_keys"].append(text_key)
110
+ to_tokenize = [to_tokenize]
111
+ else:
112
+ batch1, batch2 = [], []
113
+ for text_tuple in texts:
114
+ batch1.append(text_tuple[0])
115
+ batch2.append(text_tuple[1])
116
+ to_tokenize = [batch1, batch2]
117
+
118
+ max_seq_length = self.config.max_seq_length
119
+ output.update(
120
+ self.tokenizer(
121
+ *to_tokenize,
122
+ padding=padding,
123
+ truncation="longest_first",
124
+ return_tensors="pt",
125
+ max_length=max_seq_length,
126
+ )
127
+ )
128
+ return output
129
+
130
+ def get_config_dict(self) -> dict[str, Any]:
131
+ return {}
132
+
133
+ def save(self, output_path: str, safe_serialization: bool = True) -> None:
134
+ self.auto_model.save_pretrained(output_path, safe_serialization=safe_serialization)
135
+ self.tokenizer.save_pretrained(output_path)
136
+
137
+ with open(os.path.join(output_path, "sentence_bert_config.json"), "w") as fOut:
138
+ json.dump(self.get_config_dict(), fOut, indent=2)
139
+
140
+ @classmethod
141
+ def load(cls, input_path: str) -> Transformer:
142
+ sbert_config_path = os.path.join(input_path, "sentence_bert_config.json")
143
+ if not os.path.exists(sbert_config_path):
144
+ return cls(model_name_or_path=input_path)
145
+
146
+ with open(sbert_config_path) as fIn:
147
+ config = json.load(fIn)
148
+ # Don't allow configs to set trust_remote_code
149
+ if "model_args" in config and "trust_remote_code" in config["model_args"]:
150
+ config["model_args"].pop("trust_remote_code")
151
+ if "tokenizer_args" in config and "trust_remote_code" in config["tokenizer_args"]:
152
+ config["tokenizer_args"].pop("trust_remote_code")
153
+ if "config_args" in config and "trust_remote_code" in config["config_args"]:
154
+ config["config_args"].pop("trust_remote_code")
155
+ return cls(model_name_or_path=input_path, **config)
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": true,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,945 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "|||IP_ADDRESS|||",
5
+ "lstrip": false,
6
+ "normalized": true,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": false
10
+ },
11
+ "1": {
12
+ "content": "<|padding|>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "50254": {
20
+ "content": " ",
21
+ "lstrip": false,
22
+ "normalized": true,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": false
26
+ },
27
+ "50255": {
28
+ "content": " ",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": false
34
+ },
35
+ "50256": {
36
+ "content": " ",
37
+ "lstrip": false,
38
+ "normalized": true,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": false
42
+ },
43
+ "50257": {
44
+ "content": " ",
45
+ "lstrip": false,
46
+ "normalized": true,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": false
50
+ },
51
+ "50258": {
52
+ "content": " ",
53
+ "lstrip": false,
54
+ "normalized": true,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": false
58
+ },
59
+ "50259": {
60
+ "content": " ",
61
+ "lstrip": false,
62
+ "normalized": true,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": false
66
+ },
67
+ "50260": {
68
+ "content": " ",
69
+ "lstrip": false,
70
+ "normalized": true,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": false
74
+ },
75
+ "50261": {
76
+ "content": " ",
77
+ "lstrip": false,
78
+ "normalized": true,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": false
82
+ },
83
+ "50262": {
84
+ "content": " ",
85
+ "lstrip": false,
86
+ "normalized": true,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": false
90
+ },
91
+ "50263": {
92
+ "content": " ",
93
+ "lstrip": false,
94
+ "normalized": true,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": false
98
+ },
99
+ "50264": {
100
+ "content": " ",
101
+ "lstrip": false,
102
+ "normalized": true,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": false
106
+ },
107
+ "50265": {
108
+ "content": " ",
109
+ "lstrip": false,
110
+ "normalized": true,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": false
114
+ },
115
+ "50266": {
116
+ "content": " ",
117
+ "lstrip": false,
118
+ "normalized": true,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": false
122
+ },
123
+ "50267": {
124
+ "content": " ",
125
+ "lstrip": false,
126
+ "normalized": true,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": false
130
+ },
131
+ "50268": {
132
+ "content": " ",
133
+ "lstrip": false,
134
+ "normalized": true,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": false
138
+ },
139
+ "50269": {
140
+ "content": " ",
141
+ "lstrip": false,
142
+ "normalized": true,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": false
146
+ },
147
+ "50270": {
148
+ "content": " ",
149
+ "lstrip": false,
150
+ "normalized": true,
151
+ "rstrip": false,
152
+ "single_word": false,
153
+ "special": false
154
+ },
155
+ "50271": {
156
+ "content": " ",
157
+ "lstrip": false,
158
+ "normalized": true,
159
+ "rstrip": false,
160
+ "single_word": false,
161
+ "special": false
162
+ },
163
+ "50272": {
164
+ "content": " ",
165
+ "lstrip": false,
166
+ "normalized": true,
167
+ "rstrip": false,
168
+ "single_word": false,
169
+ "special": false
170
+ },
171
+ "50273": {
172
+ "content": " ",
173
+ "lstrip": false,
174
+ "normalized": true,
175
+ "rstrip": false,
176
+ "single_word": false,
177
+ "special": false
178
+ },
179
+ "50274": {
180
+ "content": " ",
181
+ "lstrip": false,
182
+ "normalized": true,
183
+ "rstrip": false,
184
+ "single_word": false,
185
+ "special": false
186
+ },
187
+ "50275": {
188
+ "content": " ",
189
+ "lstrip": false,
190
+ "normalized": true,
191
+ "rstrip": false,
192
+ "single_word": false,
193
+ "special": false
194
+ },
195
+ "50276": {
196
+ "content": " ",
197
+ "lstrip": false,
198
+ "normalized": true,
199
+ "rstrip": false,
200
+ "single_word": false,
201
+ "special": false
202
+ },
203
+ "50277": {
204
+ "content": "|||EMAIL_ADDRESS|||",
205
+ "lstrip": false,
206
+ "normalized": true,
207
+ "rstrip": false,
208
+ "single_word": false,
209
+ "special": false
210
+ },
211
+ "50278": {
212
+ "content": "|||PHONE_NUMBER|||",
213
+ "lstrip": false,
214
+ "normalized": true,
215
+ "rstrip": false,
216
+ "single_word": false,
217
+ "special": false
218
+ },
219
+ "50279": {
220
+ "content": "<|endoftext|>",
221
+ "lstrip": false,
222
+ "normalized": false,
223
+ "rstrip": false,
224
+ "single_word": false,
225
+ "special": true
226
+ },
227
+ "50280": {
228
+ "content": "[UNK]",
229
+ "lstrip": false,
230
+ "normalized": false,
231
+ "rstrip": false,
232
+ "single_word": false,
233
+ "special": true
234
+ },
235
+ "50281": {
236
+ "content": "[CLS]",
237
+ "lstrip": false,
238
+ "normalized": false,
239
+ "rstrip": false,
240
+ "single_word": false,
241
+ "special": true
242
+ },
243
+ "50282": {
244
+ "content": "[SEP]",
245
+ "lstrip": false,
246
+ "normalized": false,
247
+ "rstrip": false,
248
+ "single_word": false,
249
+ "special": true
250
+ },
251
+ "50283": {
252
+ "content": "[PAD]",
253
+ "lstrip": false,
254
+ "normalized": false,
255
+ "rstrip": false,
256
+ "single_word": false,
257
+ "special": true
258
+ },
259
+ "50284": {
260
+ "content": "[MASK]",
261
+ "lstrip": true,
262
+ "normalized": false,
263
+ "rstrip": false,
264
+ "single_word": false,
265
+ "special": true
266
+ },
267
+ "50285": {
268
+ "content": "[unused0]",
269
+ "lstrip": false,
270
+ "normalized": true,
271
+ "rstrip": false,
272
+ "single_word": false,
273
+ "special": false
274
+ },
275
+ "50286": {
276
+ "content": "[unused1]",
277
+ "lstrip": false,
278
+ "normalized": true,
279
+ "rstrip": false,
280
+ "single_word": false,
281
+ "special": false
282
+ },
283
+ "50287": {
284
+ "content": "[unused2]",
285
+ "lstrip": false,
286
+ "normalized": true,
287
+ "rstrip": false,
288
+ "single_word": false,
289
+ "special": false
290
+ },
291
+ "50288": {
292
+ "content": "[unused3]",
293
+ "lstrip": false,
294
+ "normalized": true,
295
+ "rstrip": false,
296
+ "single_word": false,
297
+ "special": false
298
+ },
299
+ "50289": {
300
+ "content": "[unused4]",
301
+ "lstrip": false,
302
+ "normalized": true,
303
+ "rstrip": false,
304
+ "single_word": false,
305
+ "special": false
306
+ },
307
+ "50290": {
308
+ "content": "[unused5]",
309
+ "lstrip": false,
310
+ "normalized": true,
311
+ "rstrip": false,
312
+ "single_word": false,
313
+ "special": false
314
+ },
315
+ "50291": {
316
+ "content": "[unused6]",
317
+ "lstrip": false,
318
+ "normalized": true,
319
+ "rstrip": false,
320
+ "single_word": false,
321
+ "special": false
322
+ },
323
+ "50292": {
324
+ "content": "[unused7]",
325
+ "lstrip": false,
326
+ "normalized": true,
327
+ "rstrip": false,
328
+ "single_word": false,
329
+ "special": false
330
+ },
331
+ "50293": {
332
+ "content": "[unused8]",
333
+ "lstrip": false,
334
+ "normalized": true,
335
+ "rstrip": false,
336
+ "single_word": false,
337
+ "special": false
338
+ },
339
+ "50294": {
340
+ "content": "[unused9]",
341
+ "lstrip": false,
342
+ "normalized": true,
343
+ "rstrip": false,
344
+ "single_word": false,
345
+ "special": false
346
+ },
347
+ "50295": {
348
+ "content": "[unused10]",
349
+ "lstrip": false,
350
+ "normalized": true,
351
+ "rstrip": false,
352
+ "single_word": false,
353
+ "special": false
354
+ },
355
+ "50296": {
356
+ "content": "[unused11]",
357
+ "lstrip": false,
358
+ "normalized": true,
359
+ "rstrip": false,
360
+ "single_word": false,
361
+ "special": false
362
+ },
363
+ "50297": {
364
+ "content": "[unused12]",
365
+ "lstrip": false,
366
+ "normalized": true,
367
+ "rstrip": false,
368
+ "single_word": false,
369
+ "special": false
370
+ },
371
+ "50298": {
372
+ "content": "[unused13]",
373
+ "lstrip": false,
374
+ "normalized": true,
375
+ "rstrip": false,
376
+ "single_word": false,
377
+ "special": false
378
+ },
379
+ "50299": {
380
+ "content": "[unused14]",
381
+ "lstrip": false,
382
+ "normalized": true,
383
+ "rstrip": false,
384
+ "single_word": false,
385
+ "special": false
386
+ },
387
+ "50300": {
388
+ "content": "[unused15]",
389
+ "lstrip": false,
390
+ "normalized": true,
391
+ "rstrip": false,
392
+ "single_word": false,
393
+ "special": false
394
+ },
395
+ "50301": {
396
+ "content": "[unused16]",
397
+ "lstrip": false,
398
+ "normalized": true,
399
+ "rstrip": false,
400
+ "single_word": false,
401
+ "special": false
402
+ },
403
+ "50302": {
404
+ "content": "[unused17]",
405
+ "lstrip": false,
406
+ "normalized": true,
407
+ "rstrip": false,
408
+ "single_word": false,
409
+ "special": false
410
+ },
411
+ "50303": {
412
+ "content": "[unused18]",
413
+ "lstrip": false,
414
+ "normalized": true,
415
+ "rstrip": false,
416
+ "single_word": false,
417
+ "special": false
418
+ },
419
+ "50304": {
420
+ "content": "[unused19]",
421
+ "lstrip": false,
422
+ "normalized": true,
423
+ "rstrip": false,
424
+ "single_word": false,
425
+ "special": false
426
+ },
427
+ "50305": {
428
+ "content": "[unused20]",
429
+ "lstrip": false,
430
+ "normalized": true,
431
+ "rstrip": false,
432
+ "single_word": false,
433
+ "special": false
434
+ },
435
+ "50306": {
436
+ "content": "[unused21]",
437
+ "lstrip": false,
438
+ "normalized": true,
439
+ "rstrip": false,
440
+ "single_word": false,
441
+ "special": false
442
+ },
443
+ "50307": {
444
+ "content": "[unused22]",
445
+ "lstrip": false,
446
+ "normalized": true,
447
+ "rstrip": false,
448
+ "single_word": false,
449
+ "special": false
450
+ },
451
+ "50308": {
452
+ "content": "[unused23]",
453
+ "lstrip": false,
454
+ "normalized": true,
455
+ "rstrip": false,
456
+ "single_word": false,
457
+ "special": false
458
+ },
459
+ "50309": {
460
+ "content": "[unused24]",
461
+ "lstrip": false,
462
+ "normalized": true,
463
+ "rstrip": false,
464
+ "single_word": false,
465
+ "special": false
466
+ },
467
+ "50310": {
468
+ "content": "[unused25]",
469
+ "lstrip": false,
470
+ "normalized": true,
471
+ "rstrip": false,
472
+ "single_word": false,
473
+ "special": false
474
+ },
475
+ "50311": {
476
+ "content": "[unused26]",
477
+ "lstrip": false,
478
+ "normalized": true,
479
+ "rstrip": false,
480
+ "single_word": false,
481
+ "special": false
482
+ },
483
+ "50312": {
484
+ "content": "[unused27]",
485
+ "lstrip": false,
486
+ "normalized": true,
487
+ "rstrip": false,
488
+ "single_word": false,
489
+ "special": false
490
+ },
491
+ "50313": {
492
+ "content": "[unused28]",
493
+ "lstrip": false,
494
+ "normalized": true,
495
+ "rstrip": false,
496
+ "single_word": false,
497
+ "special": false
498
+ },
499
+ "50314": {
500
+ "content": "[unused29]",
501
+ "lstrip": false,
502
+ "normalized": true,
503
+ "rstrip": false,
504
+ "single_word": false,
505
+ "special": false
506
+ },
507
+ "50315": {
508
+ "content": "[unused30]",
509
+ "lstrip": false,
510
+ "normalized": true,
511
+ "rstrip": false,
512
+ "single_word": false,
513
+ "special": false
514
+ },
515
+ "50316": {
516
+ "content": "[unused31]",
517
+ "lstrip": false,
518
+ "normalized": true,
519
+ "rstrip": false,
520
+ "single_word": false,
521
+ "special": false
522
+ },
523
+ "50317": {
524
+ "content": "[unused32]",
525
+ "lstrip": false,
526
+ "normalized": true,
527
+ "rstrip": false,
528
+ "single_word": false,
529
+ "special": false
530
+ },
531
+ "50318": {
532
+ "content": "[unused33]",
533
+ "lstrip": false,
534
+ "normalized": true,
535
+ "rstrip": false,
536
+ "single_word": false,
537
+ "special": false
538
+ },
539
+ "50319": {
540
+ "content": "[unused34]",
541
+ "lstrip": false,
542
+ "normalized": true,
543
+ "rstrip": false,
544
+ "single_word": false,
545
+ "special": false
546
+ },
547
+ "50320": {
548
+ "content": "[unused35]",
549
+ "lstrip": false,
550
+ "normalized": true,
551
+ "rstrip": false,
552
+ "single_word": false,
553
+ "special": false
554
+ },
555
+ "50321": {
556
+ "content": "[unused36]",
557
+ "lstrip": false,
558
+ "normalized": true,
559
+ "rstrip": false,
560
+ "single_word": false,
561
+ "special": false
562
+ },
563
+ "50322": {
564
+ "content": "[unused37]",
565
+ "lstrip": false,
566
+ "normalized": true,
567
+ "rstrip": false,
568
+ "single_word": false,
569
+ "special": false
570
+ },
571
+ "50323": {
572
+ "content": "[unused38]",
573
+ "lstrip": false,
574
+ "normalized": true,
575
+ "rstrip": false,
576
+ "single_word": false,
577
+ "special": false
578
+ },
579
+ "50324": {
580
+ "content": "[unused39]",
581
+ "lstrip": false,
582
+ "normalized": true,
583
+ "rstrip": false,
584
+ "single_word": false,
585
+ "special": false
586
+ },
587
+ "50325": {
588
+ "content": "[unused40]",
589
+ "lstrip": false,
590
+ "normalized": true,
591
+ "rstrip": false,
592
+ "single_word": false,
593
+ "special": false
594
+ },
595
+ "50326": {
596
+ "content": "[unused41]",
597
+ "lstrip": false,
598
+ "normalized": true,
599
+ "rstrip": false,
600
+ "single_word": false,
601
+ "special": false
602
+ },
603
+ "50327": {
604
+ "content": "[unused42]",
605
+ "lstrip": false,
606
+ "normalized": true,
607
+ "rstrip": false,
608
+ "single_word": false,
609
+ "special": false
610
+ },
611
+ "50328": {
612
+ "content": "[unused43]",
613
+ "lstrip": false,
614
+ "normalized": true,
615
+ "rstrip": false,
616
+ "single_word": false,
617
+ "special": false
618
+ },
619
+ "50329": {
620
+ "content": "[unused44]",
621
+ "lstrip": false,
622
+ "normalized": true,
623
+ "rstrip": false,
624
+ "single_word": false,
625
+ "special": false
626
+ },
627
+ "50330": {
628
+ "content": "[unused45]",
629
+ "lstrip": false,
630
+ "normalized": true,
631
+ "rstrip": false,
632
+ "single_word": false,
633
+ "special": false
634
+ },
635
+ "50331": {
636
+ "content": "[unused46]",
637
+ "lstrip": false,
638
+ "normalized": true,
639
+ "rstrip": false,
640
+ "single_word": false,
641
+ "special": false
642
+ },
643
+ "50332": {
644
+ "content": "[unused47]",
645
+ "lstrip": false,
646
+ "normalized": true,
647
+ "rstrip": false,
648
+ "single_word": false,
649
+ "special": false
650
+ },
651
+ "50333": {
652
+ "content": "[unused48]",
653
+ "lstrip": false,
654
+ "normalized": true,
655
+ "rstrip": false,
656
+ "single_word": false,
657
+ "special": false
658
+ },
659
+ "50334": {
660
+ "content": "[unused49]",
661
+ "lstrip": false,
662
+ "normalized": true,
663
+ "rstrip": false,
664
+ "single_word": false,
665
+ "special": false
666
+ },
667
+ "50335": {
668
+ "content": "[unused50]",
669
+ "lstrip": false,
670
+ "normalized": true,
671
+ "rstrip": false,
672
+ "single_word": false,
673
+ "special": false
674
+ },
675
+ "50336": {
676
+ "content": "[unused51]",
677
+ "lstrip": false,
678
+ "normalized": true,
679
+ "rstrip": false,
680
+ "single_word": false,
681
+ "special": false
682
+ },
683
+ "50337": {
684
+ "content": "[unused52]",
685
+ "lstrip": false,
686
+ "normalized": true,
687
+ "rstrip": false,
688
+ "single_word": false,
689
+ "special": false
690
+ },
691
+ "50338": {
692
+ "content": "[unused53]",
693
+ "lstrip": false,
694
+ "normalized": true,
695
+ "rstrip": false,
696
+ "single_word": false,
697
+ "special": false
698
+ },
699
+ "50339": {
700
+ "content": "[unused54]",
701
+ "lstrip": false,
702
+ "normalized": true,
703
+ "rstrip": false,
704
+ "single_word": false,
705
+ "special": false
706
+ },
707
+ "50340": {
708
+ "content": "[unused55]",
709
+ "lstrip": false,
710
+ "normalized": true,
711
+ "rstrip": false,
712
+ "single_word": false,
713
+ "special": false
714
+ },
715
+ "50341": {
716
+ "content": "[unused56]",
717
+ "lstrip": false,
718
+ "normalized": true,
719
+ "rstrip": false,
720
+ "single_word": false,
721
+ "special": false
722
+ },
723
+ "50342": {
724
+ "content": "[unused57]",
725
+ "lstrip": false,
726
+ "normalized": true,
727
+ "rstrip": false,
728
+ "single_word": false,
729
+ "special": false
730
+ },
731
+ "50343": {
732
+ "content": "[unused58]",
733
+ "lstrip": false,
734
+ "normalized": true,
735
+ "rstrip": false,
736
+ "single_word": false,
737
+ "special": false
738
+ },
739
+ "50344": {
740
+ "content": "[unused59]",
741
+ "lstrip": false,
742
+ "normalized": true,
743
+ "rstrip": false,
744
+ "single_word": false,
745
+ "special": false
746
+ },
747
+ "50345": {
748
+ "content": "[unused60]",
749
+ "lstrip": false,
750
+ "normalized": true,
751
+ "rstrip": false,
752
+ "single_word": false,
753
+ "special": false
754
+ },
755
+ "50346": {
756
+ "content": "[unused61]",
757
+ "lstrip": false,
758
+ "normalized": true,
759
+ "rstrip": false,
760
+ "single_word": false,
761
+ "special": false
762
+ },
763
+ "50347": {
764
+ "content": "[unused62]",
765
+ "lstrip": false,
766
+ "normalized": true,
767
+ "rstrip": false,
768
+ "single_word": false,
769
+ "special": false
770
+ },
771
+ "50348": {
772
+ "content": "[unused63]",
773
+ "lstrip": false,
774
+ "normalized": true,
775
+ "rstrip": false,
776
+ "single_word": false,
777
+ "special": false
778
+ },
779
+ "50349": {
780
+ "content": "[unused64]",
781
+ "lstrip": false,
782
+ "normalized": true,
783
+ "rstrip": false,
784
+ "single_word": false,
785
+ "special": false
786
+ },
787
+ "50350": {
788
+ "content": "[unused65]",
789
+ "lstrip": false,
790
+ "normalized": true,
791
+ "rstrip": false,
792
+ "single_word": false,
793
+ "special": false
794
+ },
795
+ "50351": {
796
+ "content": "[unused66]",
797
+ "lstrip": false,
798
+ "normalized": true,
799
+ "rstrip": false,
800
+ "single_word": false,
801
+ "special": false
802
+ },
803
+ "50352": {
804
+ "content": "[unused67]",
805
+ "lstrip": false,
806
+ "normalized": true,
807
+ "rstrip": false,
808
+ "single_word": false,
809
+ "special": false
810
+ },
811
+ "50353": {
812
+ "content": "[unused68]",
813
+ "lstrip": false,
814
+ "normalized": true,
815
+ "rstrip": false,
816
+ "single_word": false,
817
+ "special": false
818
+ },
819
+ "50354": {
820
+ "content": "[unused69]",
821
+ "lstrip": false,
822
+ "normalized": true,
823
+ "rstrip": false,
824
+ "single_word": false,
825
+ "special": false
826
+ },
827
+ "50355": {
828
+ "content": "[unused70]",
829
+ "lstrip": false,
830
+ "normalized": true,
831
+ "rstrip": false,
832
+ "single_word": false,
833
+ "special": false
834
+ },
835
+ "50356": {
836
+ "content": "[unused71]",
837
+ "lstrip": false,
838
+ "normalized": true,
839
+ "rstrip": false,
840
+ "single_word": false,
841
+ "special": false
842
+ },
843
+ "50357": {
844
+ "content": "[unused72]",
845
+ "lstrip": false,
846
+ "normalized": true,
847
+ "rstrip": false,
848
+ "single_word": false,
849
+ "special": false
850
+ },
851
+ "50358": {
852
+ "content": "[unused73]",
853
+ "lstrip": false,
854
+ "normalized": true,
855
+ "rstrip": false,
856
+ "single_word": false,
857
+ "special": false
858
+ },
859
+ "50359": {
860
+ "content": "[unused74]",
861
+ "lstrip": false,
862
+ "normalized": true,
863
+ "rstrip": false,
864
+ "single_word": false,
865
+ "special": false
866
+ },
867
+ "50360": {
868
+ "content": "[unused75]",
869
+ "lstrip": false,
870
+ "normalized": true,
871
+ "rstrip": false,
872
+ "single_word": false,
873
+ "special": false
874
+ },
875
+ "50361": {
876
+ "content": "[unused76]",
877
+ "lstrip": false,
878
+ "normalized": true,
879
+ "rstrip": false,
880
+ "single_word": false,
881
+ "special": false
882
+ },
883
+ "50362": {
884
+ "content": "[unused77]",
885
+ "lstrip": false,
886
+ "normalized": true,
887
+ "rstrip": false,
888
+ "single_word": false,
889
+ "special": false
890
+ },
891
+ "50363": {
892
+ "content": "[unused78]",
893
+ "lstrip": false,
894
+ "normalized": true,
895
+ "rstrip": false,
896
+ "single_word": false,
897
+ "special": false
898
+ },
899
+ "50364": {
900
+ "content": "[unused79]",
901
+ "lstrip": false,
902
+ "normalized": true,
903
+ "rstrip": false,
904
+ "single_word": false,
905
+ "special": false
906
+ },
907
+ "50365": {
908
+ "content": "[unused80]",
909
+ "lstrip": false,
910
+ "normalized": true,
911
+ "rstrip": false,
912
+ "single_word": false,
913
+ "special": false
914
+ },
915
+ "50366": {
916
+ "content": "[unused81]",
917
+ "lstrip": false,
918
+ "normalized": true,
919
+ "rstrip": false,
920
+ "single_word": false,
921
+ "special": false
922
+ },
923
+ "50367": {
924
+ "content": "[unused82]",
925
+ "lstrip": false,
926
+ "normalized": true,
927
+ "rstrip": false,
928
+ "single_word": false,
929
+ "special": false
930
+ }
931
+ },
932
+ "clean_up_tokenization_spaces": true,
933
+ "cls_token": "[CLS]",
934
+ "extra_special_tokens": {},
935
+ "mask_token": "[MASK]",
936
+ "model_input_names": [
937
+ "input_ids",
938
+ "attention_mask"
939
+ ],
940
+ "model_max_length": 8192,
941
+ "pad_token": "[PAD]",
942
+ "sep_token": "[SEP]",
943
+ "tokenizer_class": "PreTrainedTokenizer",
944
+ "unk_token": "[UNK]"
945
+ }