Tajik Word2Vec Word Embedding Model

This repository contains a pretrained Word2Vec (Skip-gram) model for the Tajik language, trained on an extensive corpus of Tajik texts using Negative Sampling.

The model is suitable for use in various NLP tasks such as:

  • Text classification
  • Semantic similarity detection
  • Building classifiers and other downstream models

Licensed under the MIT License, which allows free usage in both research and commercial applications.


📊 Model Overview

Parameter Value
Model Type Word2Vec (Skip-gram)
Vector Size 300
Vocabulary Size 145,232
Context Window 5
Min Word Count ≥ 5
Supports OOV ❌ No

📚 Training Corpus

Books (Total: 99)

  • Programming: 6
  • History: 4
  • Religion: 12
  • Scientific: 3
  • Children's literature: 6
  • Prose: 19
  • Poetry: 21
  • Textbooks: 28

Articles (Total: 134,497)

  • Asia-Plus: 20,471
  • Khovar: 21,557
  • Ovozi Tojik: 7,495
  • Farazh: 4,679
  • Wikipedia: 80,295

Total Corpus Statistics

  • Documents: 134,596
  • Tokens: ~33.5 million
  • Unique Lemmas: 649,308

🧪 Comparison with Meta FastText

We evaluated this Word2Vec model against the Meta FastText model using semantic similarity and Spearman correlation:

Model Spearman Correlation OOV Support
FastText (Meta) 0.703 Yes
Word2Vec (our) 0.558 ❌ No

While Word2Vec shows lower overall correlation compared to FastText, it performs well on frequent words and basic lexical relationships.


🔍 Example Similar Words

Word Nearest Neighbors (Word2Vec)
кӯдак кӯдакон(0.68), кӯдакро(0.67), кӯдаки(0.64), кўдак(0.64), модару(0.63)
муаллим омӯзгор(0.66), муаллима(0.62), муаллимҳо(0.61), муаллимамон(0.60), синфамон(0.59)
об оби(0.71), обро(0.70), обаш(0.59), олк(0.55), дебиташ(0.55)
мард зан(0.74), мардро(0.57), зану(0.56), нафирам(0.56), занро(0.55)
мактаб интернат(0.71), дабистон(0.66), интернати(0.64), литсей(0.63), мактабҳо(0.63)
савод хатту(0.75), хату(0.70), саводро(0.69), хонданро(0.62), саводи(0.60)

🧩 Handling OOV (Out-of-Vocabulary) Words

Unlike FastText, Word2Vec does not support vector generation for unknown (OOV) words. If a word was not present during training, the model cannot produce its embedding.

Example: The word меҳмонамон — is not present in the vocabulary of the Word2Vec model.

If you need support for rare or unseen words, we recommend using the Meta FastText version for Tajik.


📌 Features for the Tajik Language

Our model performs well on:

  • Frequent lemmas: e.g., "об", "мактаб", "савод"
  • Semantic similarity: e.g., "мард" ↔ "зан", "муаллим" ↔ "омӯзгор"
  • Simple forms: especially effective on base words without complex morphology

💡 Usage Example

from gensim.models import Word2Vec

model = Word2Vec.load("tajik_word2vec.model")

vector = model.wv["падар"]  # Get vector for a word
similar_words = model.wv.most_similar("модар")  # Find similar words

🗂️ Files Included

File Description
tajik_word2vec.model Gensim Word2Vec model file
*.npy files Supporting NumPy arrays for vectors

📚 Citation

If you use this model, please cite:

@misc{ArabovMK_Tajik_Word2Vec,
  author = {ArabovMK},
  title = {Tajik Word2Vec Word Embeddings},
  year = 2025,
  publisher = {Hugging Face},
  url = {https://huggingface.co/ArabovMK/tajik-word2vec-model}
}

Last updated: 2025-05-10

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support