Tajik Word2Vec Word Embedding Model
This repository contains a pretrained Word2Vec (Skip-gram) model for the Tajik language, trained on an extensive corpus of Tajik texts using Negative Sampling.
The model is suitable for use in various NLP tasks such as:
- Text classification
- Semantic similarity detection
- Building classifiers and other downstream models
Licensed under the MIT License, which allows free usage in both research and commercial applications.
📊 Model Overview
Parameter | Value |
---|---|
Model Type | Word2Vec (Skip-gram) |
Vector Size | 300 |
Vocabulary Size | 145,232 |
Context Window | 5 |
Min Word Count | ≥ 5 |
Supports OOV | ❌ No |
📚 Training Corpus
Books (Total: 99)
- Programming: 6
- History: 4
- Religion: 12
- Scientific: 3
- Children's literature: 6
- Prose: 19
- Poetry: 21
- Textbooks: 28
Articles (Total: 134,497)
- Asia-Plus: 20,471
- Khovar: 21,557
- Ovozi Tojik: 7,495
- Farazh: 4,679
- Wikipedia: 80,295
Total Corpus Statistics
- Documents: 134,596
- Tokens: ~33.5 million
- Unique Lemmas: 649,308
🧪 Comparison with Meta FastText
We evaluated this Word2Vec model against the Meta FastText model using semantic similarity and Spearman correlation:
Model | Spearman Correlation | OOV Support |
---|---|---|
FastText (Meta) | 0.703 | Yes |
Word2Vec (our) | 0.558 | ❌ No |
While Word2Vec shows lower overall correlation compared to FastText, it performs well on frequent words and basic lexical relationships.
🔍 Example Similar Words
Word | Nearest Neighbors (Word2Vec) |
---|---|
кӯдак | кӯдакон(0.68), кӯдакро(0.67), кӯдаки(0.64), кўдак(0.64), модару(0.63) |
муаллим | омӯзгор(0.66), муаллима(0.62), муаллимҳо(0.61), муаллимамон(0.60), синфамон(0.59) |
об | оби(0.71), обро(0.70), обаш(0.59), олк(0.55), дебиташ(0.55) |
мард | зан(0.74), мардро(0.57), зану(0.56), нафирам(0.56), занро(0.55) |
мактаб | интернат(0.71), дабистон(0.66), интернати(0.64), литсей(0.63), мактабҳо(0.63) |
савод | хатту(0.75), хату(0.70), саводро(0.69), хонданро(0.62), саводи(0.60) |
🧩 Handling OOV (Out-of-Vocabulary) Words
Unlike FastText, Word2Vec does not support vector generation for unknown (OOV) words. If a word was not present during training, the model cannot produce its embedding.
Example: The word меҳмонамон
— is not present in the vocabulary of the Word2Vec model.
If you need support for rare or unseen words, we recommend using the Meta FastText version for Tajik.
📌 Features for the Tajik Language
Our model performs well on:
- Frequent lemmas: e.g., "об", "мактаб", "савод"
- Semantic similarity: e.g., "мард" ↔ "зан", "муаллим" ↔ "омӯзгор"
- Simple forms: especially effective on base words without complex morphology
💡 Usage Example
from gensim.models import Word2Vec
model = Word2Vec.load("tajik_word2vec.model")
vector = model.wv["падар"] # Get vector for a word
similar_words = model.wv.most_similar("модар") # Find similar words
🗂️ Files Included
File | Description |
---|---|
tajik_word2vec.model |
Gensim Word2Vec model file |
*.npy files |
Supporting NumPy arrays for vectors |
📚 Citation
If you use this model, please cite:
@misc{ArabovMK_Tajik_Word2Vec,
author = {ArabovMK},
title = {Tajik Word2Vec Word Embeddings},
year = 2025,
publisher = {Hugging Face},
url = {https://huggingface.co/ArabovMK/tajik-word2vec-model}
}
Last updated: 2025-05-10