GaroVec / README.md
Badnyal's picture
Update README.md
6fc4873 verified
---
language:
- en
- grt
tags:
- embeddings
- bilingual
- garo
- low-resource
license: cc-by-sa-4.0
datasets:
- private
model-index:
- name: GaroVec v1.0
results: []
---
# GaroVec v1.0 — Hybrid English↔Garo Embeddings
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17083589.svg)](https://doi.org/10.5281/zenodo.17083589)
## Overview
**GaroVec v1.0** is the *first publicly documented Latin-script Garo embedding model*.
It combines:
- **FastText embeddings** (English + Garo)
- **Cross-lingual alignment** using Procrustes rotation
- **Frequency-based bilingual dictionary** for high-confidence word translations
This hybrid design provides both **semantic embeddings** and **direct dictionary lookups**, making it useful for cross-lingual tasks like translation support, lexicon building, and low-resource NLP research.
---
## Training
- **Data size**: ~2,500 English ↔ Garo parallel sentences
(synthetic English generated with open models, manually translated by native Garo speakers)
- **Method**:
- FastText skipgram (300-dimensional vectors, char n-grams 3–6, 25 epochs)
- Linear alignment (Procrustes) between English and Garo vector spaces
- Frequency-based dictionary extracted from the parallel corpus
- **Artifacts**:
- `garovec_garo.bin` — Garo FastText embeddings
- `garovec_english.bin` — English FastText embeddings
- `garovec_alignment_matrix.npy` — alignment matrix
- `garovec_model.pkl` — final hybrid model with dictionary + embeddings
---
## Usage
```python
import pickle
import fasttext
import numpy as np
# Load hybrid model data
with open("garovec_model.pkl", "rb") as f:
garovec_data = pickle.load(f)
# Load embeddings
garo_model = fasttext.load_model("garovec_garo.bin")
english_model = fasttext.load_model("garovec_english.bin")
W = np.load("garovec_alignment_matrix.npy")
# Example: get nearest Garo words for English word
vec = english_model.get_word_vector("love")
aligned_vec = vec @ W
candidates = [garo_model.get_word_vector(w) for w in garo_model.words[:100]]
```
---
## Limitations
- Trained on a **small parallel dataset** (~2.5k sentences).
- Optimized for **demonstrations, lexicon building, and low-resource NLP exploration**.
- Future versions will incorporate more data and advanced techniques for broader coverage.
---
## License
- **Model weights & code**: CC-BY-SA 4.0
- **Training data**: private, not released
---
## Citation
If you use **GaroVec v1.0**, please cite:
```
MWire Labs. 2025. GaroVec v1.0 — Hybrid English↔Garo Embeddings. Hugging Face.
```
---
## Acknowledgements
Built by **MWire Labs** with contributions from native Garo speakers.
This release is part of the **NEODAC project** (Northeast India Domain-Adapted Corpus).