|
--- |
|
language: |
|
- en |
|
- kha |
|
license: mit |
|
library_name: fasttext |
|
tags: |
|
- embeddings |
|
- word-embeddings |
|
- khasi |
|
- multilingual |
|
- northeast-india |
|
- low-resource |
|
- Meghalaya |
|
datasets: |
|
- custom |
|
metrics: |
|
- cosine_similarity |
|
model-index: |
|
- name: Badnyal/khasi-english-embeddings |
|
results: |
|
- task: |
|
type: word-similarity |
|
name: Cross-lingual Word Similarity |
|
dataset: |
|
name: Khasi-English Parallel Corpus |
|
type: custom |
|
metrics: |
|
- type: cosine_similarity |
|
value: 0.29 |
|
name: Cross-lingual Similarity Score |
|
--- |
|
|
|
# Khasi-English Word Embeddings |
|
|
|
## Model Description |
|
|
|
This model provides the first comprehensive word embeddings for the Khasi language, trained on a bilingual Khasi-English corpus. Khasi is an Austroasiatic language of the Mon-Khmer branch, spoken primarily in Meghalaya, Northeast India. |
|
## Model Architecture |
|
|
|
- **Model Type**: FastText (Skip-gram) |
|
- **Embedding Dimension**: 300 |
|
- **Vocabulary Size**: 38,220 tokens |
|
- **Training Algorithm**: Hierarchical Softmax |
|
- **Context Window**: 5 words |
|
|
|
## Training Data |
|
|
|
The model was trained on a curated corpus containing: |
|
- **63,909 Khasi sentences** from diverse sources |
|
- **65,239 English sentences** for cross-lingual alignment |
|
- **65,241 parallel translation pairs** |
|
|
|
### Data Sources |
|
- Clean Khasi text corpus |
|
- Processed historical documents |
|
- Bilingual translation datasets |
|
- Cultural and administrative texts |
|
|
|
## Performance Metrics |
|
|
|
| Metric | Value | |
|
|--------|-------| |
|
| Vocabulary Coverage | 38,220 words | |
|
| Cross-lingual Similarity | 0.290 | |
|
| Training Epochs | 20 | |
|
| Embedding Dimension | 300 | |
|
|
|
## Usage |
|
|
|
### Loading the Model |
|
|
|
```python |
|
import fasttext |
|
|
|
# Load the model |
|
model = fasttext.load_model('khasi_embeddings.bin') |
|
|
|
# Get word vector |
|
vector = model.get_word_vector('__khasi__ ka') |
|
|
|
# Find similar words |
|
similar_words = model.get_nearest_neighbors('__khasi__ ka', k=10) |
|
``` |
|
|
|
### Cross-lingual Queries |
|
|
|
```python |
|
# English to Khasi semantic similarity |
|
khasi_word = model.get_word_vector('__khasi__ bad') |
|
english_word = model.get_word_vector('__english__ and') |
|
|
|
# Calculate similarity |
|
from sklearn.metrics.pairwise import cosine_similarity |
|
similarity = cosine_similarity([khasi_word], [english_word])[0][0] |
|
``` |
|
|
|
## Language Coverage |
|
|
|
### Khasi Language Features |
|
- Native script support |
|
- Morphological variations |
|
- Cultural terminology |
|
- Administrative vocabulary |
|
|
|
### Cross-lingual Capabilities |
|
- Khasi-English semantic alignment |
|
- Translation assistance |
|
- Cultural concept mapping |
|
|
|
## Limitations |
|
|
|
- **Cross-lingual alignment**: Limited by structural differences between Khasi and English |
|
- **Domain coverage**: Primarily trained on formal/administrative texts |
|
- **Dialectal variations**: May not capture all regional Khasi variants |
|
|
|
## Intended Use |
|
|
|
This model is designed for: |
|
- **Research**: Computational linguistics studies on Khasi |
|
- **Language preservation**: Digital archiving and analysis |
|
- **Educational tools**: Language learning applications |
|
- **Cultural preservation**: Maintaining indigenous knowledge |
|
|
|
## Ethical Considerations |
|
|
|
This model was developed with respect for Khasi cultural heritage and language preservation goals. Users are encouraged to collaborate with Khasi language communities when deploying this model. |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@misc{khasi-embeddings-2025, |
|
title={Khasi-English Word Embeddings: First Comprehensive Embeddings for Khasi Language}, |
|
author={Badnyal}, |
|
year={2025}, |
|
publisher={Hugging Face}, |
|
howpublished={\url{https://huggingface.co/Badnyal/khasi-english-embeddings}} |
|
} |
|
``` |
|
|
|
## Acknowledgments |
|
|
|
Special thanks to the contributors to the preservation of indigenous languages of Northeast India. |
|
|
|
## Contact |
|
|
|
For questions, collaborations, or feedback regarding this model, please open an issue in the model repository. |
|
|
|
--- |
|
|
|
*This model represents pioneering work in Khasi language processing and serves as a foundation for future research in Northeast Indian computational linguistics.* |