arvindcreatrix's picture
Initial commit: fine-tuned BGE-base on custom Q&A data
5e216cd verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:100231
  - loss:MultipleNegativesRankingLoss
base_model: BAAI/bge-base-en-v1.5
widget:
  - source_sentence: >-
      Represent this sentence for searching relevant passages: when did
      university of georgia start playing football
    sentences:
      - >-
        Georgia Bulldogs football The Georgia Bulldogs football program
        represents the University of Georgia in the sport of American football.
        The Bulldogs compete in the Football Bowl Subdivision (FBS) of the
        National Collegiate Athletic Association (NCAA) and the Eastern Division
        of the Southeastern Conference (SEC). They play their home games at
        historic Sanford Stadium on the university's Athens, Georgia, campus.
        Georgia's inaugural season was in 1892. UGA claims two consensus
        national championships (1942 and 1980); the AP and Coaches Polls have
        each voted the Bulldogs the national champion once (1980); Georgia has
        also been named the National Champion by at least one polling authority
        in three other seasons (1927, 1946 and 1968).[5] The Bulldogs have won
        15 conference championships, including 13 SEC championships (tied for
        2nd most all-time), and have appeared in 54 bowl games, tied for second
        most all-time. The program has also produced two Heisman Trophy winners,
        four No. 1 National Football League (NFL) draft picks, and many winners
        of other national awards. The team is known for its storied history,
        unique traditions, and rabid fan base. Georgia has won over 800 games in
        their history, placing them 11th all time in wins.[6]
      - >-
        Octavio Dotel Octavio Eduardo Dotel Diaz (born November 25, 1973) is a
        Dominican former professional baseball pitcher. Dotel played for
        thirteen major league teams, more than any other player in the history
        of Major League Baseball (MLB), setting the mark when he pitched for the
        Detroit Tigers on April 7, 2012, breaking a record previously held by
        Mike Morgan, Matt Stairs, and Ron Villone.[1] Edwin Jackson tied this
        record in 2018.[2] He was a member of the Houston Astros for 5 seasons.
      - >-
        Who Wants to Be a Millionaire (U.S. game show) Over the course of the
        programs history, 12 people have answered the final question correctly
        and walked away with the top prize. These include:
  - source_sentence: >-
      Represent this sentence for searching relevant passages: when was video
      killed the radio star released
    sentences:
      - >-
        Video Killed the Radio Star "Video Killed the Radio Star" is a song
        written by Trevor Horn, Geoff Downes and Bruce Woolley in 1978. It was
        first recorded by Bruce Woolley and The Camera Club (with Thomas Dolby
        on keyboards) for their album English Garden, and later by British group
        the Buggles, consisting of Horn and Downes. The track was recorded and
        mixed in 1979, released as their debut single on 7 September 1979 by
        Island Records, and included on their first album The Age of Plastic.
        The backing track was recorded at Virgin's Town House in West London,
        and mixing and vocal recording would later take place at Sarm East
        Studios.
      - >-
        Haley Joel Osment Haley Joel Osment (born April 10, 1988) is an American
        actor. After a series of roles in television and film during the 1990s,
        including a small part in Forrest Gump playing the title character's son
        (also named Forrest Gump), Osment rose to fame for his performance as a
        young unwilling medium in M. Night Shyamalan's thriller film The Sixth
        Sense, which earned him a nomination for the Academy Award for Best
        Supporting Actor. He subsequently appeared in leading roles in several
        high-profile Hollywood films, including Steven Spielberg's A.I.
        Artificial Intelligence and Mimi Leder's Pay It Forward.
      - >-
        How Come The song is about the relationship between the members of D12.
        Eminem makes reference to his relationship to Proof, Kon Artis talks
        about Eminem and Kim's relationship, and Proof talks about the rift
        between him and Eminem.
  - source_sentence: >-
      Represent this sentence for searching relevant passages: when do the
      badgers play their first football game
    sentences:
      - >-
        The Song of Roland Charlemagne's army is fighting the Muslims in Spain.
        They have been there for seven years, and the last city standing is
        Saragossa, held by the Muslim King Marsile. Threatened by the might of
        Charlemagne's army of Franks, Marsile seeks advice from his wise man,
        Blancandrin, who councils him to conciliate the Emperor, offering to
        surrender and giving hostages. Accordingly, Marsile sends out messengers
        to Charlemagne, promising treasure and Marsile's conversion to
        Christianity if the Franks will go back to France.
      - >-
        Tell Your Heart to Beat Again "Tell Your Heart to Beat Again" is a song
        written by Bernie Herms, Randy Phillips, and Matthew West and originally
        recorded by Contemporary Christian-Worship trio, Phillips, Craig and
        Dean for their twelfth studio album, Breathe In (2012).[1] It was later
        recorded by American singer Danny Gokey for his second studio album,
        Hope in Front of Me (2014), with production by Herms. Gokey's rendition
        was released to Christian radio January 8, 2016 through BMG-Chrysalis as
        the album's third single.[2] The song reached number two on the
        Billboard Christian Songs chart, earning Gokey his highest-charting
        single to date.[3]
      - >-
        Wisconsin Badgers football The first Badger football team took the field
        in 1889, losing the only two games it played that season. In 1890,
        Wisconsin earned its first victory with a 106–0 drubbing of Whitewater
        Normal School (now the University of Wisconsin–Whitewater), still the
        most lopsided win in school history. However, the very next week the
        Badgers suffered what remains their most lopsided defeat, a humiliating
        63–0 loss at the hands of the University of Minnesota. Since then, the
        Badgers and Gophers have met 122 times, making Wisconsin vs Minnesota
        the most-played rivalry in the Football Bowl Subdivision.[5]
  - source_sentence: >-
      Represent this sentence for searching relevant passages: percentage of
      babies born at 24 weeks that survive
    sentences:
      - >-
        Industrial Revolution The precise start and end of the Industrial
        Revolution is still debated among historians, as is the pace of economic
        and social changes.[10][11][12][13] Eric Hobsbawm held that the
        Industrial Revolution began in Britain in the 1780s and was not fully
        felt until the 1830s or 1840s,[10] while T. S. Ashton held that it
        occurred roughly between 1760 and 1830.[11] Rapid industrialization
        first began in Britain, starting with mechanized spinning in the
        1780s,[14] with high rates of growth in steam power and iron production
        occurring after 1800. Mechanized textile production spread from Great
        Britain to continental Europe and the United States in the early 19th
        century, with important centres of textiles, iron and coal emerging in
        Belgium and the United States and later textiles in France.[1]
      - >-
        The Shootist The film's expansive outdoor scenes were filmed on location
        in Carson City. Bond Rogers' boarding house is the 1914 Krebs-Peterson
        House, located in Carson City’s historic residential district. The buggy
        ride was shot at Washoe Lake State Park, in the Washoe Valley, between
        Reno and Carson City. Though it was a Paramount production, the street
        scenes and most interior shots were filmed at the Warner Bros. backlot
        and sound stages in Burbank, California.[8] The horse-drawn trolley was
        an authentic one, once used as a shuttle between El Paso and Juarez,
        Mexico.[9]
      - "Fetal viability There is no sharp limit of development, gestational age, or weight at which a human fetus automatically becomes viable.[1] According to studies between 2003 and 2005, 20 to 35 percent of babies born at 23 weeks of gestation survive, while 50 to 70 percent of babies born at 24 to 25 weeks, and more than 90 percent born at 26 to 27 weeks, survive.[4] It is rare for a baby weighing less than 500\_g (17.6\_ounces) to survive.[1] A baby's chances for survival increase 3-4% per day between 23 and 24 weeks of gestation and about 2-3% per day between 24 and 26 weeks of gestation. After 26 weeks the rate of survival increases at a much slower rate because survival is high already.[5][6][7][8]"
  - source_sentence: >-
      Represent this sentence for searching relevant passages: when did bear
      inthe big blue house come out
    sentences:
      - >-
        Court of Appeals of the Philippines The Court of Appeals of the
        Philippines (Filipino: Hukuman ng Apelasyon ng Pilipinas) is the
        Philippines' second-highest judicial court, just after the Supreme
        Court. The court consists of 69 Associate Justices and 1 Presiding
        Justice. Under the Constitution, the Court of Appeals (CA) "reviews not
        only the decisions and orders of the Regional Trial Courts nationwide
        but also those of the Court of Tax Appeals, as well as the awards,
        judgments, final orders or resolutions of, or authorized by 21
        Quasi-Judicial Agencies exercising quasi-judicial functions mentioned in
        Rule 43 of the 1997 Rules of Civil Procedure, plus the National Amnesty
        Commission (Pres. Proclamation No. 347 of 1994) and Office of the
        Ombudsman (Fabian v. Desierto, 295 SCRA 470). Under RA 9282, which
        elevated the CTA to the same level of the CA, CTA en banc decisions are
        now subject to review by the Supreme Court instead of the CA (as opposed
        to what is currently provided in Section 1, Rule 43 of the Rules of
        Court). Added to the formidable list are the decisions and resolutions
        of the National Labor Relations Commission (NLRC) which are now
        initially reviewable by this court, instead of a direct recourse to the
        Supreme Court, via petition for certiorari under Rule 65 (St. Martin
        Funeral Homes v. NLRC, 295 SCRA 414)".
      - >-
        Diatom Diatoms are a widespread group and can be found in the oceans, in
        fresh water, in soils, and on damp surfaces. They are one of the
        dominant components of phytoplankton in nutrient-rich coastal waters and
        during oceanic spring blooms, since they can divide more rapidly than
        other groups of phytoplankton.[66] Most live pelagically in open water,
        although some live as surface films at the water-sediment interface
        (benthic), or even under damp atmospheric conditions. They are
        especially important in oceans, where they contribute an estimated 45%
        of the total oceanic primary production of organic material.[67] Spatial
        distribution of marine phytoplankton species is restricted both
        horizontally and vertically.[68][20]
      - >-
        Bear in the Big Blue House Bear in the Big Blue House is an American
        children's television series created by Mitchell Kriegman and produced
        by Jim Henson Television for Disney Channel's Playhouse Disney preschool
        television block. Debuting on October 20, 1997,[1][2] it aired its last
        episode on April 28, 2006.
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on BAAI/bge-base-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-base-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("arvindcreatrix/bge-baes-my-qna-model")
# Run inference
sentences = [
    'Represent this sentence for searching relevant passages: when did bear inthe big blue house come out',
    "Bear in the Big Blue House Bear in the Big Blue House is an American children's television series created by Mitchell Kriegman and produced by Jim Henson Television for Disney Channel's Playhouse Disney preschool television block. Debuting on October 20, 1997,[1][2] it aired its last episode on April 28, 2006.",
    'Court of Appeals of the Philippines The Court of Appeals of the Philippines (Filipino: Hukuman ng Apelasyon ng Pilipinas) is the Philippines\' second-highest judicial court, just after the Supreme Court. The court consists of 69 Associate Justices and 1 Presiding Justice. Under the Constitution, the Court of Appeals (CA) "reviews not only the decisions and orders of the Regional Trial Courts nationwide but also those of the Court of Tax Appeals, as well as the awards, judgments, final orders or resolutions of, or authorized by 21 Quasi-Judicial Agencies exercising quasi-judicial functions mentioned in Rule 43 of the 1997 Rules of Civil Procedure, plus the National Amnesty Commission (Pres. Proclamation No. 347 of 1994) and Office of the Ombudsman (Fabian v. Desierto, 295 SCRA 470). Under RA 9282, which elevated the CTA to the same level of the CA, CTA en banc decisions are now subject to review by the Supreme Court instead of the CA (as opposed to what is currently provided in Section 1, Rule 43 of the Rules of Court). Added to the formidable list are the decisions and resolutions of the National Labor Relations Commission (NLRC) which are now initially reviewable by this court, instead of a direct recourse to the Supreme Court, via petition for certiorari under Rule 65 (St. Martin Funeral Homes v. NLRC, 295 SCRA 414)".',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 100,231 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 17 tokens
    • mean: 19.79 tokens
    • max: 31 tokens
    • min: 18 tokens
    • mean: 137.51 tokens
    • max: 512 tokens
  • Samples:
    sentence_0 sentence_1
    Represent this sentence for searching relevant passages: in baseball statistics what does opie s stand for On-base plus slugging On-base plus slugging (OPS) is a sabermetric baseball statistic calculated as the sum of a player's on-base percentage and slugging average.[1] The ability of a player both to get on base and to hit for power, two important offensive skills, are represented. An OPS of .900 or higher in Major League Baseball puts the player in the upper echelon of hitters. Typically, the league leader in OPS will score near, and sometimes above, the 1.000 mark.
    Represent this sentence for searching relevant passages: what was the first year of the nissan gtr Nissan GT-R The Nissan GT-R is a 2-door 2+2 high performance vehicle produced by Nissan, unveiled in 2007.[2][3][4] It is the successor to the Nissan Skyline GT-R, although no longer part of the Skyline range itself, the name having been given over to the R35 Series and having since left its racing roots.
    Represent this sentence for searching relevant passages: how long do former presidents receive secret service protection Former Presidents Act Former presidents were entitled from 1965 to 1996 to lifetime Secret Service protection, for themselves, spouses, and children under 16. A 1994 statute, (Pub.L. 103–329), limited post-presidential protection to ten years for presidents inaugurated after January 1, 1997.[7] Under this statute, Bill Clinton would still be entitled to lifetime protection, and all subsequent presidents would have been entitled to ten years' protection.[8] On January 10, 2013, President Barack Obama signed the Former Presidents Protection Act of 2012, reinstating lifetime Secret Service protection for his predecessor George W. Bush, himself, and all subsequent presidents.[9]
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • num_train_epochs: 1
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss
0.1596 500 0.0357
0.3192 1000 0.0175
0.4788 1500 0.0144
0.6384 2000 0.0142
0.7980 2500 0.0144
0.9575 3000 0.0133

Framework Versions

  • Python: 3.11.13
  • Sentence Transformers: 4.1.0
  • Transformers: 4.54.0
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.9.0
  • Datasets: 4.0.0
  • Tokenizers: 0.21.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}