SentenceTransformer based on Snowflake/snowflake-arctic-embed-m-v2.0
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m-v2.0. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Snowflake/snowflake-arctic-embed-m-v2.0
- Maximum Sequence Length: 8192 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: GteModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'How does the text suggest addressing the social aspects related to low- and middle-income transport users in the context of zero-emission vehicle initiatives?',
'(b)\n\nmeasures intended to accelerate the uptake of zero-emission vehicles or to provide financial support for the deployment of fully interoperable refuelling and recharging infrastructure for zero-emission vehicles, or measures to encourage a shift to public transport and improve multimodality, or to provide financial support in order to address social aspects concerning low- and middle-income transport users;\n\n(c)\n\nto finance their Social Climate Plan in accordance with Article 15 of Regulation (EU) 2023/955;\n\n(d)',
'If the planned change is implemented notwithstanding the first and second subparagraphs, or if an unplanned change has taken place pursuant to which the AIFM’s management of the AIF no longer complies with this Directive or the AIFM otherwise no longer complies with this Directive, the competent authorities of the Member State of reference of the AIFM shall take all due measures in accordance with Article 46, including, if necessary, the express prohibition of marketing of the AIF.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Information Retrieval
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.7059 |
cosine_accuracy@3 | 0.9068 |
cosine_accuracy@5 | 0.9448 |
cosine_accuracy@10 | 0.9731 |
cosine_precision@1 | 0.7059 |
cosine_precision@3 | 0.3023 |
cosine_precision@5 | 0.189 |
cosine_precision@10 | 0.0973 |
cosine_recall@1 | 0.7059 |
cosine_recall@3 | 0.9068 |
cosine_recall@5 | 0.9448 |
cosine_recall@10 | 0.9731 |
cosine_ndcg@10 | 0.8513 |
cosine_mrr@10 | 0.8109 |
cosine_map@100 | 0.8123 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 46,338 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1 type string string details - min: 9 tokens
- mean: 39.98 tokens
- max: 286 tokens
- min: 3 tokens
- mean: 248.72 tokens
- max: 1315 tokens
- Samples:
sentence_0 sentence_1 What is the maximum allowable reduction in excise duty for mixtures used as motor fuels containing biodiesel in Italy until 30 June 2004?
for waste oils which are reused as fuel, either directly after recovery or following a recycling process for waste oils, and where the reuse is subject to duty.
8. ITALY:
for differentiated rates of excise duty on mixtures used as motor fuels containing 5 % or 25 % of biodiesel until 30 June 2004. The reduction in excise duty may not be greater than the amount of excise duty payable on the volume of biofuels present in the products eligible for the reduction. The reduction in excise duty shall be adjusted to take account of changes in the price of raw materials to avoid overcompensating for the extra costs involved in the manufacture of biofuels;What are the minimum indicative share percentages for the years 2023 to 2030, and how do these percentages relate to the interconnectivity levels of the Member States?
Such indicative shares may, in each year, amount to at least 5 % from 2023 to 2026 and at least 10 % from 2027 to 2030, or, where lower, to the level of interconnectivity of the Member State concerned in any given year.
In order to acquire further implementation experience, Member States may organise one or more pilot schemes where support is open to producers located in other Member States.
2.What is the significance of the one-month period mentioned in the context?
one month after its notification, in accordance with the arrangements provided for in Article 23.
- Loss:
MatryoshkaLoss
with these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsnum_train_epochs
: 4fp16
: Truemulti_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 8per_device_eval_batch_size
: 8per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 4max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Truefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | Training Loss | cosine_ndcg@10 |
---|---|---|---|
0.0863 | 500 | 0.225 | - |
0.1726 | 1000 | 0.1337 | - |
0.2589 | 1500 | 0.1195 | - |
0.3452 | 2000 | 0.0803 | - |
0.4316 | 2500 | 0.0775 | - |
0.5179 | 3000 | 0.0714 | - |
0.6042 | 3500 | 0.0852 | - |
0.6905 | 4000 | 0.0718 | - |
0.7768 | 4500 | 0.0499 | - |
0.8631 | 5000 | 0.0665 | 0.8371 |
0.9494 | 5500 | 0.0674 | - |
1.0 | 5793 | - | 0.8416 |
1.0357 | 6000 | 0.0538 | - |
1.1220 | 6500 | 0.0606 | - |
1.2084 | 7000 | 0.0294 | - |
1.2947 | 7500 | 0.0129 | - |
1.3810 | 8000 | 0.0101 | - |
1.4673 | 8500 | 0.0072 | - |
1.5536 | 9000 | 0.0211 | - |
1.6399 | 9500 | 0.0133 | - |
1.7262 | 10000 | 0.0063 | 0.8513 |
Framework Versions
- Python: 3.10.15
- Sentence Transformers: 4.0.2
- Transformers: 4.49.0
- PyTorch: 2.6.0+cu126
- Accelerate: 0.26.0
- Datasets: 3.5.0
- Tokenizers: 0.21.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 2
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for amentaphd/eu-regulation-embeddings-snowflake-m-v2
Base model
Snowflake/snowflake-arctic-embed-m-v2.0Evaluation results
- Cosine Accuracy@1 on Unknownself-reported0.706
- Cosine Accuracy@3 on Unknownself-reported0.907
- Cosine Accuracy@5 on Unknownself-reported0.945
- Cosine Accuracy@10 on Unknownself-reported0.973
- Cosine Precision@1 on Unknownself-reported0.706
- Cosine Precision@3 on Unknownself-reported0.302
- Cosine Precision@5 on Unknownself-reported0.189
- Cosine Precision@10 on Unknownself-reported0.097
- Cosine Recall@1 on Unknownself-reported0.706
- Cosine Recall@3 on Unknownself-reported0.907