Tajik Word2Vec Word Embedding Model

This repository contains a pretrained Word2Vec (Skip-gram) model for the Tajik language, trained on an extensive corpus of Tajik texts using Negative Sampling.

The model is suitable for use in various NLP tasks such as:

Text classification
Semantic similarity detection
Building classifiers and other downstream models

Licensed under the MIT License, which allows free usage in both research and commercial applications.

📊 Model Overview

Parameter	Value
Model Type	Word2Vec (Skip-gram)
Vector Size	300
Vocabulary Size	145,232
Context Window	5
Min Word Count	≥ 5
Supports OOV	❌ No

📚 Training Corpus

Books (Total: 99)

Programming: 6
History: 4
Religion: 12
Scientific: 3
Children's literature: 6
Prose: 19
Poetry: 21
Textbooks: 28

Articles (Total: 134,497)

Asia-Plus: 20,471
Khovar: 21,557
Ovozi Tojik: 7,495
Farazh: 4,679
Wikipedia: 80,295

Total Corpus Statistics

Documents: 134,596
Tokens: ~33.5 million
Unique Lemmas: 649,308

🧪 Comparison with Meta FastText

We evaluated this Word2Vec model against the Meta FastText model using semantic similarity and Spearman correlation:

Model	Spearman Correlation	OOV Support
FastText (Meta)	0.703	Yes
Word2Vec (our)	0.558	❌ No

While Word2Vec shows lower overall correlation compared to FastText, it performs well on frequent words and basic lexical relationships.

🔍 Example Similar Words

Word	Nearest Neighbors (Word2Vec)
кӯдак	кӯдакон(0.68), кӯдакро(0.67), кӯдаки(0.64), кўдак(0.64), модару(0.63)
муаллим	омӯзгор(0.66), муаллима(0.62), муаллимҳо(0.61), муаллимамон(0.60), синфамон(0.59)
об	оби(0.71), обро(0.70), обаш(0.59), олк(0.55), дебиташ(0.55)
мард	зан(0.74), мардро(0.57), зану(0.56), нафирам(0.56), занро(0.55)
мактаб	интернат(0.71), дабистон(0.66), интернати(0.64), литсей(0.63), мактабҳо(0.63)
савод	хатту(0.75), хату(0.70), саводро(0.69), хонданро(0.62), саводи(0.60)

🧩 Handling OOV (Out-of-Vocabulary) Words

Unlike FastText, Word2Vec does not support vector generation for unknown (OOV) words. If a word was not present during training, the model cannot produce its embedding.

Example: The word меҳмонамон — is not present in the vocabulary of the Word2Vec model.

If you need support for rare or unseen words, we recommend using the Meta FastText version for Tajik.

📌 Features for the Tajik Language

Our model performs well on:

Frequent lemmas: e.g., "об", "мактаб", "савод"
Semantic similarity: e.g., "мард" ↔ "зан", "муаллим" ↔ "омӯзгор"
Simple forms: especially effective on base words without complex morphology

💡 Usage Example

from gensim.models import Word2Vec

model = Word2Vec.load("tajik_word2vec.model")

vector = model.wv["падар"]  # Get vector for a word
similar_words = model.wv.most_similar("модар")  # Find similar words

🗂️ Files Included

File	Description
`tajik_word2vec.model`	Gensim Word2Vec model file
`*.npy` files	Supporting NumPy arrays for vectors

📚 Citation

If you use this model, please cite:

@misc{ArabovMK_Tajik_Word2Vec,
  author = {ArabovMK},
  title = {Tajik Word2Vec Word Embeddings},
  year = 2025,
  publisher = {Hugging Face},
  url = {https://huggingface.co/ArabovMK/tajik-word2vec-model}
}

Last updated: 2025-05-10