albertus-sussex's picture
Add new SentenceTransformer model
6e88b16 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:4937
  - loss:AttributeTripletLoss
base_model: Alibaba-NLP/gte-base-en-v1.5
widget:
  - source_sentence: Betty J. Eadie
    sentences:
      - Danny Gregory
      - Grove Press / Atlantic Monthly Press
      - author
      - publisher
  - source_sentence: George Heussenstamm
    sentences:
      - '1994'
      - author
      - publication_date
      - Devon Monk
  - source_sentence: Jeff Lindsay
    sentences:
      - '9780061655500'
      - Robert S. Kaplan
      - author
      - isbn_13
  - source_sentence: >-
      Caesar's Legion: The Epic Saga of Julius Caesar's Elite Tenth Legion and
      the Armies of Rome
    sentences:
      - '1999'
      - publication_date
      - Let the Great World Spin
      - title
  - source_sentence: '2002'
    sentences:
      - isbn_13
      - publication_date
      - 01 May 2008
      - '9780924944031'
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy
  - silhouette_cosine
  - silhouette_euclidean
model-index:
  - name: SentenceTransformer based on Alibaba-NLP/gte-base-en-v1.5
    results:
      - task:
          type: triplet
          name: Triplet
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: cosine_accuracy
            value: 1
            name: Cosine Accuracy
          - type: cosine_accuracy
            value: 1
            name: Cosine Accuracy
      - task:
          type: silhouette
          name: Silhouette
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: silhouette_cosine
            value: 0.9085697531700134
            name: Silhouette Cosine
          - type: silhouette_euclidean
            value: 0.7462140917778015
            name: Silhouette Euclidean
          - type: silhouette_cosine
            value: 0.9105125665664673
            name: Silhouette Cosine
          - type: silhouette_euclidean
            value: 0.7465167045593262
            name: Silhouette Euclidean

SentenceTransformer based on Alibaba-NLP/gte-base-en-v1.5

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Alibaba-NLP/gte-base-en-v1.5
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("albertus-sussex/veriscrape-sbert-book-reference_2_to_verify_8-fold-3")
# Run inference
sentences = [
    '2002',
    '01 May 2008',
    '9780924944031',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric Value
cosine_accuracy 1.0

Silhouette

  • Evaluated with veriscrape.training.SilhouetteEvaluator
Metric Value
silhouette_cosine 0.9086
silhouette_euclidean 0.7462

Triplet

Metric Value
cosine_accuracy 1.0

Silhouette

  • Evaluated with veriscrape.training.SilhouetteEvaluator
Metric Value
silhouette_cosine 0.9105
silhouette_euclidean 0.7465

Training Details

Training Dataset

Unnamed Dataset

  • Size: 4,937 training samples
  • Columns: anchor, positive, negative, pos_attr_name, and neg_attr_name
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative pos_attr_name neg_attr_name
    type string string string string string
    details
    • min: 3 tokens
    • mean: 7.14 tokens
    • max: 30 tokens
    • min: 3 tokens
    • mean: 6.74 tokens
    • max: 31 tokens
    • min: 3 tokens
    • mean: 6.34 tokens
    • max: 30 tokens
    • min: 3 tokens
    • mean: 3.74 tokens
    • max: 5 tokens
    • min: 3 tokens
    • mean: 3.79 tokens
    • max: 5 tokens
  • Samples:
    anchor positive negative pos_attr_name neg_attr_name
    Anatomy Lessons from the Great Masters: 100 Great Figure Drawings Analysed Out of Egypt: A Memoir Agate Publishing title publisher
    9780439888141 9781573225342 Living Dead in Dallas: A Sookie Stackhouse Novel isbn_13 title
    06 October 2006 01 June 2008 Colum McCann publication_date author
  • Loss: veriscrape.training.AttributeTripletLoss with these parameters:
    {
        "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
        "triplet_margin": 5
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 549 evaluation samples
  • Columns: anchor, positive, negative, pos_attr_name, and neg_attr_name
  • Approximate statistics based on the first 549 samples:
    anchor positive negative pos_attr_name neg_attr_name
    type string string string string string
    details
    • min: 3 tokens
    • mean: 7.1 tokens
    • max: 29 tokens
    • min: 3 tokens
    • mean: 6.57 tokens
    • max: 33 tokens
    • min: 3 tokens
    • mean: 6.52 tokens
    • max: 30 tokens
    • min: 3 tokens
    • mean: 3.78 tokens
    • max: 5 tokens
    • min: 3 tokens
    • mean: 3.88 tokens
    • max: 5 tokens
  • Samples:
    anchor positive negative pos_attr_name neg_attr_name
    Terry Pratchett Ally Condie Penguin Books Ltd author publisher
    9780395851821 9780755382170 Uncle John's Unstoppable Bathroom Reader isbn_13 title
    Virago Pr Topaz Don Quixote publisher title
  • Loss: veriscrape.training.AttributeTripletLoss with these parameters:
    {
        "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
        "triplet_margin": 5
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • num_train_epochs: 5
  • warmup_ratio: 0.1

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss cosine_accuracy silhouette_cosine
-1 -1 - - 0.5392 0.2047
1.0 39 1.2132 0.0497 0.9982 0.8942
2.0 78 0.0263 0.0 1.0 0.9008
3.0 117 0.005 0.0 1.0 0.9079
4.0 156 0.0006 0.0014 1.0 0.9077
5.0 195 0.0005 0.0017 1.0 0.9086
-1 -1 - - 1.0 0.9105

Framework Versions

  • Python: 3.10.16
  • Sentence Transformers: 3.4.1
  • Transformers: 4.45.2
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.5.2
  • Datasets: 3.1.0
  • Tokenizers: 0.20.3

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

AttributeTripletLoss

@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification},
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}