SILMA Arabic Matryoshka Embedding Model 0.1

The SILMA Arabic Matryoshka Embedding Model 0.1 is an advanced Arabic text embedding model designed to produce powerful, contextually rich representations of text, facilitating a wide range of applications, from semantic search to document classification.

This model leverages the innovative Matryoshka Embedding technique which can be used in different dimensions to optimize the speed, storage, and accuracy trade-offs.

Usage

Direct Usage (Sentence Transformers)

First, install the Sentence Transformers library:

pip install -U sentence-transformers

Then load the model

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
import pandas as pd

model_name = "silma-ai/silma-embeddding-matryoshka-0.1"
model = SentenceTransformer(model_name)

Samples

Using Matryoshka, you can specify the first (n) dimensions to represent each text.

In the following samples, you can check how each dimension affects the cosine similarity between a query and the two inputs.

You can notice the in most cases, even too low dimension (i.e. 8) can produce acceptable semantic similarity scores.

[+] Short Sentence Similarity

query = "الطقس اليوم مشمس"
sentence_1 = "الجو اليوم كان مشمسًا ورائعًا"
sentence_2 = "الطقس اليوم غائم"

scores = []
for dim in [768, 256, 48, 16, 8]:

    query_embedding = model.encode(query)[:dim]

    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

    scores.append({
        "dim": dim,
        "valid_top": sent1_score > sent2_score,
        "sent1_score": sent1_score,
        "sent2_score": sent2_score,
    })

scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))

# |   dim | valid_top   |   sent1_score |   sent2_score |
# |------:|:------------|--------------:|--------------:|
# |   768 | True        |      0.479942 |      0.233572 |
# |   256 | True        |      0.509289 |      0.208452 |
# |    48 | True        |      0.598825 |      0.191677 |
# |    16 | True        |      0.917707 |      0.458854 |
# |     8 | True        |      0.948563 |      0.675662 |

[+] Long Sentence Similarity

query = "الكتاب يتحدث عن أهمية الذكاء الاصطناعي في تطوير المجتمعات الحديثة"
sentence_1 = "في هذا الكتاب، يناقش الكاتب كيف يمكن للتكنولوجيا أن تغير العالم"
sentence_2 = "الكاتب يتحدث عن أساليب الطبخ التقليدية في دول البحر الأبيض المتوسط"

scores = []
for dim in [768, 256, 48, 16, 8]:

    query_embedding = model.encode(query)[:dim]

    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

    scores.append({
        "dim": dim,
        "valid_top": sent1_score > sent2_score,
        "sent1_score": sent1_score,
        "sent2_score": sent2_score,
    })

scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))

# |   dim | valid_top   |   sent1_score |   sent2_score |
# |------:|:------------|--------------:|--------------:|
# |   768 | True        |      0.637418 |      0.262693 |
# |   256 | True        |      0.614761 |      0.268267 |
# |    48 | True        |      0.758887 |      0.384649 |
# |    16 | True        |      0.885737 |      0.204213 |
# |     8 | True        |      0.918684 |      0.146478 |

[+] Question to Paragraph Matching

query = "ما هي فوائد ممارسة الرياضة؟"
sentence_1 = "ممارسة الرياضة بشكل منتظم تساعد على تحسين الصحة العامة واللياقة البدنية"
sentence_2 = "تعليم الأطفال في سن مبكرة يساعدهم على تطوير المهارات العقلية بسرعة"

scores = []
for dim in [768, 256, 48, 16, 8]:

    query_embedding = model.encode(query)[:dim]

    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

    scores.append({
        "dim": dim,
        "valid_top": sent1_score > sent2_score,
        "sent1_score": sent1_score,
        "sent2_score": sent2_score,
    })

scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))

# |   dim | valid_top   |   sent1_score |   sent2_score |
# |------:|:------------|--------------:|--------------:|
# |   768 | True        |      0.520329 |    0.00295128 |
# |   256 | True        |      0.556088 |   -0.017764   |
# |    48 | True        |      0.586194 |   -0.110691   |
# |    16 | True        |      0.606462 |   -0.331682   |
# |     8 | True        |      0.689649 |   -0.359202   |

[+] Message to Intent-Name Mapping

query = "أرغب في حجز تذكرة طيران من دبي الى القاهرة يوم الثلاثاء القادم"
sentence_1 = "حجز رحلة"
sentence_2 = "إلغاء حجز"

scores = []
for dim in [768, 256, 48, 16, 8]:

    query_embedding = model.encode(query)[:dim]

    sent1_score = cos_sim(query_embedding, model.encode(sentence_1)[:dim])[0][0].tolist()
    sent2_score = cos_sim(query_embedding, model.encode(sentence_2)[:dim])[0][0].tolist()

    scores.append({
        "dim": dim,
        "valid_top": sent1_score > sent2_score,
        "sent1_score": sent1_score,
        "sent2_score": sent2_score,
    })

scores_df = pd.DataFrame(scores)
print(scores_df.to_markdown(index=False))

# |   dim | valid_top   |   sent1_score |   sent2_score |
# |------:|:------------|--------------:|--------------:|
# |   768 | True        |     0.476535  |     0.221451  |
# |   256 | True        |     0.392701  |     0.224967  |
# |    48 | True        |     0.316223  |     0.0210683 |
# |    16 | False       |    -0.0242871 |     0.0250766 |
# |     8 | True        |    -0.215241  |    -0.258904  |

Training Details

We curated a dataset silma-ai/silma-arabic-triplets-dataset-v1.0 which contains more than 2.25M records of (anchor, positive and negative) Arabic/English samples. Only the first 600 samples were taken to be the eval dataset, while the rest were used for fine-tuning.

This produced a finetuned Matryoshka model based on aubmindlab/bert-base-arabertv02 with the following hyperparameters:

per_device_train_batch_size: 250
per_device_eval_batch_size: 10
learning_rate: 1e-05
num_train_epochs: 3
bf16: True
dataloader_drop_last: True
optim: adamw_torch_fused
batch_sampler: no_duplicates

training script

Framework Versions

Python: 3.10.14
Sentence Transformers: 3.2.0
Transformers: 4.45.2
PyTorch: 2.3.1
Accelerate: 1.0.1
Datasets: 3.0.1
Tokenizers: 0.20.1

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Citation:

BibTeX:

@misc{silma2024embedding,
  author = {Abu Bakr Soliman, Karim Ouda, SILMA AI},
  title = {SILMA Embedding Matryoshka 0.1},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1}},
}

APA:

Abu Bakr Soliman, Karim Ouda, SILMA AI. (2024). SILMA Embedding Matryoshka STS 0.1 [Model]. Hugging Face. https://huggingface.co/silma-ai/silma-embeddding-matryoshka-0.1

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: 477

Safetensors

Model size

135M params

Tensor type

F32

Inference Providers NEW

Sentence Similarity

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for silma-ai/silma-embedding-matryoshka-v0.1

Base model

aubmindlab/bert-base-arabertv02

Finetuned

(3994)

this model

Finetunes

1 model

Collection including silma-ai/silma-embedding-matryoshka-v0.1

Arabic Embedding Models

Collection

A collection of highly efficient and lightweight Arabic Embedding Models developed by SILMA AI, featuring innovative Matryoshka-based models. • 2 items • Updated Nov 15, 2024 • 1

Evaluation results

accuracy on MTEB MassiveIntentClassification (ar)
test set self-reported

56.446
f1 on MTEB MassiveIntentClassification (ar)
test set self-reported

53.583
f1_weighted on MTEB MassiveIntentClassification (ar)
test set self-reported

56.822
main_score on MTEB MassiveIntentClassification (ar)
test set self-reported

56.446
accuracy on MTEB MassiveIntentClassification (en)
test set self-reported

47.401
f1 on MTEB MassiveIntentClassification (en)
test set self-reported

44.729
f1_weighted on MTEB MassiveIntentClassification (en)
test set self-reported

47.835
main_score on MTEB MassiveIntentClassification (en)
test set self-reported

47.401
accuracy on MTEB MassiveIntentClassification (ar)
validation set self-reported

56.980
f1 on MTEB MassiveIntentClassification (ar)
validation set self-reported

53.809

View on Papers With Code