SentenceTransformer based on Snowflake/snowflake-arctic-embed-l

This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Snowflake/snowflake-arctic-embed-l
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("chelleboyer/llm-evals-2-79b954ef-4798-4994-be72-a88d46b8ecca")
# Run inference
sentences = [
    'What is the main contribution of Kwiatkowski et al. [2019] in the field of question answering research?',
    'Kwiatkowski et\xa0al. [2019]\n\nT.\xa0Kwiatkowski, J.\xa0Palomaki, O.\xa0Redfield, M.\xa0Collins, A.\xa0Parikh, C.\xa0Alberti, D.\xa0Epstein, I.\xa0Polosukhin, M.\xa0Kelcey, J.\xa0Devlin, K.\xa0Lee, K.\xa0N. Toutanova, L.\xa0Jones, M.-W. Chang, A.\xa0Dai, J.\xa0Uszkoreit, Q.\xa0Le, and S.\xa0Petrov.\n\n\nNatural questions: a benchmark for question answering research.\n\n\nTransactions of the Association of Computational Linguistics, 2019.\n\n\n\n\nLaurer et\xa0al. [2022]\n\nM.\xa0Laurer, W.\xa0van Atteveldt, A.\xa0Casas, and K.\xa0Welbers.',
    'The sentence_support_information field is a list of objects, one for each sentence\nin the response. Each object MUST have the following fields:\n- response_sentence_key: a string identifying the sentence in the response.\nThis key is the same as the one used in the response above.\n- explanation: a string explaining why the sentence is or is not supported by the\ndocuments.\n- supporting_sentence_keys: keys (e.g. ’0a’) of sentences from the documents that',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.8571
cosine_accuracy@3 0.9643
cosine_accuracy@5 1.0
cosine_accuracy@10 1.0
cosine_precision@1 0.8571
cosine_precision@3 0.3214
cosine_precision@5 0.2
cosine_precision@10 0.1
cosine_recall@1 0.8571
cosine_recall@3 0.9643
cosine_recall@5 1.0
cosine_recall@10 1.0
cosine_ndcg@10 0.9386
cosine_mrr@10 0.9179
cosine_map@100 0.9179

Training Details

Training Dataset

Unnamed Dataset

  • Size: 400 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 400 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 3 tokens
    • mean: 21.42 tokens
    • max: 53 tokens
    • min: 3 tokens
    • mean: 93.8 tokens
    • max: 200 tokens
  • Samples:
    sentence_0 sentence_1
    What are the key components and criteria used in the TRACe Evaluation Framework within RAGBench? RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems















    1 Introduction

    2 Related Work

    RAG evaluation
    Finetuned RAG evaluation models



    3 RAGBench Construction


    3.1 Component Datasets

    Source Domains
    Context Token Length
    Task Types
    Question Sources
    Response Generation
    Data Splits



    3.2 TRACe Evaluation Framework

    Definitions
    Context Relevance
    Context Utilization
    Completeness
    Adherence


    3.3 RAGBench Statistics

    3.4 LLM annotator
    How does RAGBench utilize component datasets to construct a benchmark for Retrieval-Augmented Generation systems? RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems















    1 Introduction

    2 Related Work

    RAG evaluation
    Finetuned RAG evaluation models



    3 RAGBench Construction


    3.1 Component Datasets

    Source Domains
    Context Token Length
    Task Types
    Question Sources
    Response Generation
    Data Splits



    3.2 TRACe Evaluation Framework

    Definitions
    Context Relevance
    Context Utilization
    Completeness
    Adherence


    3.3 RAGBench Statistics

    3.4 LLM annotator
    What are the key components and findings discussed in the RAGBench Statistics and Case Study sections? 3.3 RAGBench Statistics

    3.4 LLM annotator

    Alignment with Human Judgements


    3.5 RAG Case Study



    4 Experiments

    4.1 LLM Judge
    4.2 Fine-tuned Judge
    4.3 Evaluation



    5 Results

    Estimating Context Relevance is Difficult


    6 Conclusion

    7 Appendix

    7.1 RAGBench Code and Data

    7.2 RAGBench Dataset Details

    PubMedQA [14]
    CovidQA-RAG
    HotpotQA [42]
    MS Marco [28]
    CUAD [12]
    DelucionQA [33]
    EManual [27]
    TechQA [3]
    FinQA [6]
    TAT-QA [47]
    HAGRID [15]
    ExpertQA [25]
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 5
  • per_device_eval_batch_size: 5
  • num_train_epochs: 10
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 5
  • per_device_eval_batch_size: 5
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 10
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • tp_size: 0
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss cosine_ndcg@10
0.625 50 - 0.9517
1.0 80 - 0.9649
1.25 100 - 0.9649
1.875 150 - 0.9517
2.0 160 - 0.9517
2.5 200 - 0.9386
3.0 240 - 0.9386
3.125 250 - 0.9517
3.75 300 - 0.9386
4.0 320 - 0.9517
4.375 350 - 0.9517
5.0 400 - 0.9517
5.625 450 - 0.9517
6.0 480 - 0.9401
6.25 500 0.3877 0.9401
6.875 550 - 0.9386
7.0 560 - 0.9386
7.5 600 - 0.9401
8.0 640 - 0.9401
8.125 650 - 0.9401
8.75 700 - 0.9386
9.0 720 - 0.9386
9.375 750 - 0.9386
10.0 800 - 0.9386

Framework Versions

  • Python: 3.11.12
  • Sentence Transformers: 4.1.0
  • Transformers: 4.51.3
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.6.0
  • Datasets: 2.14.4
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
1
Safetensors
Model size
334M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chelleboyer/llm-evals-2-79b954ef-4798-4994-be72-a88d46b8ecca

Finetuned
(167)
this model

Space using chelleboyer/llm-evals-2-79b954ef-4798-4994-be72-a88d46b8ecca 1

Evaluation results