BGE-M3 Türkçe Triplet Matryoshka

This is a sentence-transformers model finetuned from BAAI/bge-m3 on the vodex-turkish-triplets dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-m3
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
  • Language: tr
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("seroe/bge-m3-turkish-triplet-matryoshka")
# Run inference
sentences = [
    "Vodafone Net'in internet hız garantisi var mı?",
    'Vodafone Net, internet hızını garanti etmemekte, bu hız abonenin hattının uygunluğuna ve santrale olan mesafeye bağlı olarak değişiklik göstermektedir.',
    'Vodafone Net, tüm abonelerine en az 100 Mbps hız garantisi vermektedir.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

  • Datasets: tr-triplet-dev-1024d and all-nli-test-1024d
  • Evaluated with TripletEvaluator with these parameters:
    {
        "truncate_dim": 1024
    }
    
Metric tr-triplet-dev-1024d all-nli-test-1024d
cosine_accuracy 0.6087 0.9508

Triplet

  • Datasets: tr-triplet-dev-768d and all-nli-test-768d
  • Evaluated with TripletEvaluator with these parameters:
    {
        "truncate_dim": 768
    }
    
Metric tr-triplet-dev-768d all-nli-test-768d
cosine_accuracy 0.6174 0.9533

Triplet

  • Datasets: tr-triplet-dev-512d and all-nli-test-512d
  • Evaluated with TripletEvaluator with these parameters:
    {
        "truncate_dim": 512
    }
    
Metric tr-triplet-dev-512d all-nli-test-512d
cosine_accuracy 0.6303 0.9546

Triplet

  • Datasets: tr-triplet-dev-256d and all-nli-test-256d
  • Evaluated with TripletEvaluator with these parameters:
    {
        "truncate_dim": 256
    }
    
Metric tr-triplet-dev-256d all-nli-test-256d
cosine_accuracy 0.6016 0.9546

Triplet

  • Dataset: tr-triplet-dev-1024d
  • Evaluated with TripletEvaluator with these parameters:
    {
        "truncate_dim": 1024
    }
    
Metric Value
cosine_accuracy 0.9566

Triplet

  • Dataset: tr-triplet-dev-768d
  • Evaluated with TripletEvaluator with these parameters:
    {
        "truncate_dim": 768
    }
    
Metric Value
cosine_accuracy 0.9571

Triplet

  • Dataset: tr-triplet-dev-512d
  • Evaluated with TripletEvaluator with these parameters:
    {
        "truncate_dim": 512
    }
    
Metric Value
cosine_accuracy 0.9589

Triplet

  • Dataset: tr-triplet-dev-256d
  • Evaluated with TripletEvaluator with these parameters:
    {
        "truncate_dim": 256
    }
    
Metric Value
cosine_accuracy 0.9604

Training Details

Training Dataset

vodex-turkish-triplets

  • Dataset: vodex-turkish-triplets at 0c9fab0
  • Size: 70,941 training samples
  • Columns: query, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    query positive negative
    type string string string
    details
    • min: 4 tokens
    • mean: 13.58 tokens
    • max: 46 tokens
    • min: 11 tokens
    • mean: 26.32 tokens
    • max: 61 tokens
    • min: 10 tokens
    • mean: 20.54 tokens
    • max: 45 tokens
  • Samples:
    query positive negative
    Kampanya tarihleri ve katılım şartları Kampanya, 11 Ekim 2018'de başlayıp 29 Ekim 2018'de sona erecek. Katılımcıların belirli bilgileri doldurması ve Vodafone Müzik pass veya Video pass sahibi olmaları gerekiyor. Kampanya, sadece İstanbul'daki kullanıcılar için geçerli olup, diğer şehirlerden katılım mümkün değildir.
    Taahhüt süresi dolmadan başka bir kampanyaya geçiş yapılırsa ne olur? Eğer abone taahhüt süresi dolmadan başka bir kampanyaya geçerse, bu durumda önceki kampanya süresince sağlanan indirimler ve diğer faydalar, iptal tarihinden sonraki fatura ile tahsil edilecektir. Aboneler, taahhüt süresi dolmadan başka bir kampanyaya geçtiklerinde, yeni kampanyadan faydalanmak için ek bir ücret ödemek zorundadırlar.
    FreeZone üyeliğimi nasıl sorgulayabilirim? Üyeliğinizi sorgulamak için FREEZONESORGU yazarak 1525'e SMS gönderebilirsiniz. Üyeliğinizi sorgulamak için Vodafone mağazasına gitmeniz gerekmektedir.
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "CachedMultipleNegativesRankingLoss",
        "matryoshka_dims": [
            1024,
            768,
            512,
            256
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Evaluation Dataset

vodex-turkish-triplets

  • Dataset: vodex-turkish-triplets at 0c9fab0
  • Size: 3,941 evaluation samples
  • Columns: query, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    query positive negative
    type string string string
    details
    • min: 4 tokens
    • mean: 13.26 tokens
    • max: 36 tokens
    • min: 12 tokens
    • mean: 26.55 tokens
    • max: 62 tokens
    • min: 9 tokens
    • mean: 20.4 tokens
    • max: 40 tokens
  • Samples:
    query positive negative
    Vodafone Net'e geçiş yaparken bağlantı ücreti var mı? Vodafone Net'e geçişte 264 TL bağlantı ücreti bulunmaktadır ve bu ücret 24 ay boyunca aylık 11 TL olarak faturalandırılmaktadır. Vodafone Net'e geçişte bağlantı ücreti yoktur ve tüm işlemler ücretsizdir.
    Bağımsız akıllı cihaz kampanyalarının detayları nelerdir? Kampanyalar, farklı cihaz modelleri için aylık ödeme planları sunmaktadır. Vodafone'un kampanyaları, sadece internet paketleri ile ilgilidir.
    Fibermax hizmeti iptal edilirse ne gibi sonuçlar doğar? İptal işlemi taahhüt süresi bitmeden yapılırsa, indirimler ve ücretsiz hizmet bedelleri ödenmelidir. Fibermax hizmeti iptal edildiğinde, kullanıcıdan hiçbir ücret talep edilmez.
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "CachedMultipleNegativesRankingLoss",
        "matryoshka_dims": [
            1024,
            768,
            512,
            256
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 2048
  • per_device_eval_batch_size: 256
  • weight_decay: 0.01
  • num_train_epochs: 2
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.05
  • bf16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 2048
  • per_device_eval_batch_size: 256
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.01
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.05
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • tp_size: 0
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss tr-triplet-dev-1024d_cosine_accuracy tr-triplet-dev-768d_cosine_accuracy tr-triplet-dev-512d_cosine_accuracy tr-triplet-dev-256d_cosine_accuracy all-nli-test-1024d_cosine_accuracy all-nli-test-768d_cosine_accuracy all-nli-test-512d_cosine_accuracy all-nli-test-256d_cosine_accuracy
-1 -1 - - 0.6087 0.6174 0.6303 0.6016 - - - -
0.3429 12 10.677 3.4988 0.8764 0.8807 0.8876 0.8950 - - - -
0.6857 24 6.5947 2.7219 0.9345 0.9353 0.9411 0.9419 - - - -
1.0286 36 5.777 2.4641 0.9584 0.9579 0.9602 0.9617 - - - -
1.3714 48 5.3727 2.5269 0.9531 0.9543 0.9576 0.9546 - - - -
1.7143 60 5.1485 2.4440 0.9566 0.9571 0.9589 0.9604 - - - -
-1 -1 - - - - - - 0.9508 0.9533 0.9546 0.9546

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 4.2.0.dev0
  • Transformers: 4.51.3
  • PyTorch: 2.7.0+cu126
  • Accelerate: 1.6.0
  • Datasets: 3.6.0
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
Downloads last month
376
Safetensors
Model size
568M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for seroe/bge-m3-turkish-triplet-matryoshka

Base model

BAAI/bge-m3
Finetuned
(266)
this model

Dataset used to train seroe/bge-m3-turkish-triplet-matryoshka

Evaluation results