DIMI-embedding-sts-matryoshka

State-of-the-art DIMI Sentence Embeddings for Arabic Similarity

Author: Ahmed Zaky Mouad
Email: [email protected]

This is a sentence-transformers model finetuned from AhmedZaky1/arabic-bert-nli-matryoshka. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: AhmedZaky1/arabic-bert-nli-matryoshka
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference:

from sentence_transformers import SentenceTransformer
import numpy as np

# Download from the 🤗 Hub
model = SentenceTransformer("AhmedZaky1/DIMI-embedding-sts-matryoshka")

# Basic usage - encoding sentences
sentences = [
    'ديترويت مؤهلة للحماية من الإفلاس',
    'ديترويت مؤهلة للحماية الإفلاسية: قاضي أمريكي',
    'بورصة نيويورك ستعيد فتحها الأربعاء',
    'الطقس اليوم مشمس وجميل',
    'السماء صافية والشمس مشرقة'
]

# Generate embeddings
embeddings = model.encode(sentences)
print(f"Embeddings shape: {embeddings.shape}")
# Output: Embeddings shape: (5, 768)

# Calculate similarity matrix
similarities = model.similarity(embeddings, embeddings)
print(f"Similarity matrix shape: {similarities.shape}")
# Output: Similarity matrix shape: (5, 5)

# Print similarity scores
for i, sentence1 in enumerate(sentences):
    for j, sentence2 in enumerate(sentences):
        if i < j:  # Only print upper triangle
            similarity = similarities[i][j].item()
            print(f"Similarity between '{sentence1}' and '{sentence2}': {similarity:.4f}")

Semantic Search Example

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("AhmedZaky1/DIMI-embedding-sts-matryoshka")

# Documents to search through
documents = [
    "الذكاء الاصطناعي يغير العالم بسرعة",
    "التكنولوجيا الحديثة تؤثر على حياتنا اليومية",
    "الطقس اليوم مشمس ودرجة الحرارة مناسبة",
    "كرة القدم هي الرياضة الأكثر شعبية في العالم",
    "الطبخ المنزلي أفضل من الطعام الجاهز",
    "البرمجة مهارة مهمة في العصر الحديث"
]

# Query
query = "التقنيات الجديدة وتأثيرها"

# Encode documents and query
doc_embeddings = model.encode(documents)
query_embedding = model.encode([query])

# Calculate similarities
similarities = model.similarity(query_embedding, doc_embeddings)[0]

# Get top results
top_indices = np.argsort(similarities)[::-1]

print(f"Query: {query}\n")
print("Most similar documents:")
for i, idx in enumerate(top_indices[:3]):
    print(f"{i+1}. {documents[idx]} (similarity: {similarities[idx]:.4f})")

Text Classification Example

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer("AhmedZaky1/DIMI-embedding-sts-matryoshka")

# Category examples
categories = {
    "رياضة": ["كرة القدم مباراة مثيرة", "السباحة رياضة ممتعة", "الجري يحسن الصحة"],
    "تكنولوجيا": ["الذكاء الاصطناعي متطور", "البرمجة مهارة مهمة", "الهواتف الذكية"],
    "طعام": ["الطبخ المنزلي لذيذ", "المطاعم الشعبية", "الحلويات العربية"]
}

# Encode category examples
category_embeddings = {}
for category, examples in categories.items():
    embeddings = model.encode(examples)
    category_embeddings[category] = np.mean(embeddings, axis=0)

# Classify new text
new_text = "الفريق فاز بالمباراة بصعوبة"
new_embedding = model.encode([new_text])

# Find most similar category
similarities = {}
for category, cat_embedding in category_embeddings.items():
    similarity = cosine_similarity([new_embedding[0]], [cat_embedding])[0][0]
    similarities[category] = similarity

# Get prediction
predicted_category = max(similarities, key=similarities.get)
print(f"Text: {new_text}")
print(f"Predicted category: {predicted_category}")
print(f"Confidence: {similarities[predicted_category]:.4f}")

Evaluation

Metrics

Semantic Similarity

Dataset: arabic-sts-dev
Evaluated with: EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.9649
spearman_cosine	0.9595

Training Details

Training Dataset

Dataset: Unnamed Dataset
Size: 27,788 training samples
Columns: sentence_0, sentence_1, and label

Approximate statistics based on the first 1000 samples:

	sentence_0	sentence_1	label
type	string	string	float
details	min: 4 tokens mean: 27.82 tokens max: 143 tokens	min: 4 tokens mean: 27.67 tokens max: 148 tokens	min: 0.0 mean: 0.53 max: 1.0

Sample data:

sentence_0	sentence_1	label
A man is walking along a path through wilderness.	A man is walking down a road.	0.5
China's online population rises to 618 mln	China's troubled Xinjiang hit by more violence	0.08
وجد الباحثون فقط تجاويف فارغة و نسيج ندب حيث كانت الأورام	لم يتم اكتشاف أي أورام، بل تم العثور على تجاويف فارغة ونسيج ندبة في مكانها.	0.8

Framework Versions

Python: 3.12.7
Sentence Transformers: 3.3.1
Transformers: 4.51.3
PyTorch: 2.6.0+cu124
Accelerate: 1.4.0
Datasets: 3.3.2
Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Contact Information:

Author: Ahmed Zaky Mouad
Email: [email protected]

AhmedZaky1
/

DIMI-embedding-sts-matryoshka