SentenceTransformer based on jxm/cde-small-v2

This is a sentence-transformers model finetuned from jxm/cde-small-v2. It maps sentences & paragraphs to a None-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: jxm/cde-small-v2
Maximum Sequence Length: 512 tokens
Output Dimensionality: None dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({}) with Transformer model: ContextualDocumentEmbeddingTransformer 
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("BlackBeenie/cde-small-v2-biencoder-msmarco")
# Run inference
sentences = [
    'when did jeepers creepers come out',
    'Jeepers Creepers Wiki. Creeper. Creeper is a fictional character and the main antagonist in the 2001 horror film Jeepers Creepers and its 2003 sequel Jeepers Creepers II. It is an ancient, mysterious demon who viciously feeds on the flesh and bones of many human beings for 23 days every 23rd spring.',
    ' Creep  is a song by the English alternative rock band Radiohead. Radiohead released Creep as their debut single in 1992, and it later appeared on their first album, Pablo Honey (1993). During its initial release, Creep was not a chart success.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

Size: 499,184 training samples
Columns: sentence_0, sentence_1, and sentence_2

Approximate statistics based on the first 1000 samples:

	sentence_0	sentence_1	sentence_2
type	string	string	string
details	min: 4 tokens mean: 9.26 tokens max: 29 tokens	min: 14 tokens mean: 81.55 tokens max: 203 tokens	min: 16 tokens mean: 80.95 tokens max: 231 tokens

Samples:

sentence_0	sentence_1	sentence_2
`what year did the sandy hook incident happen`	`For Newtown, 2012 Sandy Hook Elementary School shooting is still painful. It's been three years since the terrible day Jimmy Greeneâs 6-year-old daughter, Ana Grace Marquez, and 19 other children were murdered in the mass shooting at Sandy Hook Elementary School. But life without Ana, who loved to sing and dance from room to room, continues to be so hard that, in some ways, Dec. 14 is no tougher than any other day for Greene.`	`Hook is a 1991 Steven Spielberg film starring Dustin Hoffman and Robin Williams. The film's storyline is based on the books written by Sir James Matthew Barrie in 1904 or 1905 and is the sequel to the first book.`
`what kind of degree do you need to be a medical assistant?`	`If you choose this path, here is what you need to do: 1 Have a high school diploma or GED. The minimum educational requirement for medical assistants is a high school diploma or equivalency degree. 2 Find a doctor who will provide training.`	`Many colleges offer two-year associate's degrees or one-year certificate programs in different areas of medical office technology. Certificate areas include billing specialist, medical administrative assistant, and medical transcriptionist. Because of the complexity of medical jargon and operational procedures, many employers prefer these professionals to hold related two-year degrees or complete one-year training programs.`
`what does usb cord do`	`The Flash Player is required to see this video. The term USB stands for Universal Serial Bus. USB cable assemblies are some of the most popular cable types available, used mostly to connect computers to peripheral devices such as cameras, camcorders, printers, scanners, and more. Devices manufactured to the current USB Revision 3.0 specification are backward compatible with version 1.1.`	`The USB 2.0 specification for a Full-Speed/High-Speed cable calls for four wires, two for data and two for power, and a braided outer shield. The USB 3.0 specification calls for a total of 10 wires plus a braided outer shield. Two wires are used for power.`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 32
per_device_eval_batch_size: 32
fp16: True
multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 32
per_device_eval_batch_size: 32
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 3
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
tp_size: 0
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin

Training Logs

Epoch	Step	Training Loss
0.0321	500	0.9856
0.0641	1000	0.4499
0.0962	1500	0.3673
0.1282	2000	0.339
0.1603	2500	0.3118
0.1923	3000	0.2929
0.2244	3500	0.2886
0.2564	4000	0.2771
0.2885	4500	0.2762
0.3205	5000	0.2716
0.3526	5500	0.2585
0.3846	6000	0.2631
0.4167	6500	0.2458
0.4487	7000	0.2496
0.4808	7500	0.252
0.5128	8000	0.2399
0.5449	8500	0.2422
0.5769	9000	0.2461
0.6090	9500	0.2314
0.6410	10000	0.2331
0.6731	10500	0.2314
0.7051	11000	0.2302
0.7372	11500	0.235
0.7692	12000	0.2176
0.8013	12500	0.2201
0.8333	13000	0.2206
0.8654	13500	0.222
0.8974	14000	0.2136
0.9295	14500	0.2108
0.9615	15000	0.2102
0.9936	15500	0.2098
1.0256	16000	0.1209
1.0577	16500	0.099
1.0897	17000	0.0944
1.1218	17500	0.0955
1.1538	18000	0.0947
1.1859	18500	0.0953
1.2179	19000	0.0943
1.25	19500	0.0911
1.2821	20000	0.0964
1.3141	20500	0.0933
1.3462	21000	0.0956
1.3782	21500	0.0941
1.4103	22000	0.0903
1.4423	22500	0.0889
1.4744	23000	0.0919
1.5064	23500	0.0917
1.5385	24000	0.0956
1.5705	24500	0.0903
1.6026	25000	0.0931
1.6346	25500	0.0931
1.6667	26000	0.089
1.6987	26500	0.0892
1.7308	27000	0.091
1.7628	27500	0.0892
1.7949	28000	0.0884
1.8269	28500	0.0889
1.8590	29000	0.0877
1.8910	29500	0.0866
1.9231	30000	0.0853
1.9551	30500	0.085
1.9872	31000	0.0867
2.0192	31500	0.055
2.0513	32000	0.0338
2.0833	32500	0.033
2.1154	33000	0.033
2.1474	33500	0.0317
2.1795	34000	0.0323
2.2115	34500	0.0322
2.2436	35000	0.0316
2.2756	35500	0.0314
2.3077	36000	0.0312
2.3397	36500	0.0324
2.3718	37000	0.0324
2.4038	37500	0.0328
2.4359	38000	0.0311
2.4679	38500	0.0312
2.5	39000	0.0312
2.5321	39500	0.0311
2.5641	40000	0.0315
2.5962	40500	0.0308
2.6282	41000	0.0308
2.6603	41500	0.0306
2.6923	42000	0.0313
2.7244	42500	0.0322
2.7564	43000	0.0315
2.7885	43500	0.0311
2.8205	44000	0.0321
2.8526	44500	0.0318
2.8846	45000	0.0305
2.9167	45500	0.031
2.9487	46000	0.032
2.9808	46500	0.0306

Framework Versions

Python: 3.11.12
Sentence Transformers: 3.4.1
Transformers: 4.50.3
PyTorch: 2.6.0+cu124
Accelerate: 1.5.2
Datasets: 3.5.0
Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

BlackBeenie
/

cde-small-v2-biencoder-msmarco