Badnyal's picture
Update README.md
7d8e442 verified
---
language:
- en
- kha
license: mit
library_name: fasttext
tags:
- embeddings
- word-embeddings
- khasi
- multilingual
- northeast-india
- low-resource
- Meghalaya
datasets:
- custom
metrics:
- cosine_similarity
model-index:
- name: Badnyal/khasi-english-embeddings
results:
- task:
type: word-similarity
name: Cross-lingual Word Similarity
dataset:
name: Khasi-English Parallel Corpus
type: custom
metrics:
- type: cosine_similarity
value: 0.29
name: Cross-lingual Similarity Score
---
# Khasi-English Word Embeddings
## Model Description
This model provides the first comprehensive word embeddings for the Khasi language, trained on a bilingual Khasi-English corpus. Khasi is an Austroasiatic language of the Mon-Khmer branch, spoken primarily in Meghalaya, Northeast India.
## Model Architecture
- **Model Type**: FastText (Skip-gram)
- **Embedding Dimension**: 300
- **Vocabulary Size**: 38,220 tokens
- **Training Algorithm**: Hierarchical Softmax
- **Context Window**: 5 words
## Training Data
The model was trained on a curated corpus containing:
- **63,909 Khasi sentences** from diverse sources
- **65,239 English sentences** for cross-lingual alignment
- **65,241 parallel translation pairs**
### Data Sources
- Clean Khasi text corpus
- Processed historical documents
- Bilingual translation datasets
- Cultural and administrative texts
## Performance Metrics
| Metric | Value |
|--------|-------|
| Vocabulary Coverage | 38,220 words |
| Cross-lingual Similarity | 0.290 |
| Training Epochs | 20 |
| Embedding Dimension | 300 |
## Usage
### Loading the Model
```python
import fasttext
# Load the model
model = fasttext.load_model('khasi_embeddings.bin')
# Get word vector
vector = model.get_word_vector('__khasi__ ka')
# Find similar words
similar_words = model.get_nearest_neighbors('__khasi__ ka', k=10)
```
### Cross-lingual Queries
```python
# English to Khasi semantic similarity
khasi_word = model.get_word_vector('__khasi__ bad')
english_word = model.get_word_vector('__english__ and')
# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([khasi_word], [english_word])[0][0]
```
## Language Coverage
### Khasi Language Features
- Native script support
- Morphological variations
- Cultural terminology
- Administrative vocabulary
### Cross-lingual Capabilities
- Khasi-English semantic alignment
- Translation assistance
- Cultural concept mapping
## Limitations
- **Cross-lingual alignment**: Limited by structural differences between Khasi and English
- **Domain coverage**: Primarily trained on formal/administrative texts
- **Dialectal variations**: May not capture all regional Khasi variants
## Intended Use
This model is designed for:
- **Research**: Computational linguistics studies on Khasi
- **Language preservation**: Digital archiving and analysis
- **Educational tools**: Language learning applications
- **Cultural preservation**: Maintaining indigenous knowledge
## Ethical Considerations
This model was developed with respect for Khasi cultural heritage and language preservation goals. Users are encouraged to collaborate with Khasi language communities when deploying this model.
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{khasi-embeddings-2025,
title={Khasi-English Word Embeddings: First Comprehensive Embeddings for Khasi Language},
author={Badnyal},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/Badnyal/khasi-english-embeddings}}
}
```
## Acknowledgments
Special thanks to the contributors to the preservation of indigenous languages of Northeast India.
## Contact
For questions, collaborations, or feedback regarding this model, please open an issue in the model repository.
---
*This model represents pioneering work in Khasi language processing and serves as a foundation for future research in Northeast Indian computational linguistics.*