|
--- |
|
language: |
|
- en |
|
- grt |
|
tags: |
|
- embeddings |
|
- bilingual |
|
- garo |
|
- low-resource |
|
license: cc-by-sa-4.0 |
|
datasets: |
|
- private |
|
model-index: |
|
- name: GaroVec v1.0 |
|
results: [] |
|
--- |
|
|
|
# GaroVec v1.0 — Hybrid English↔Garo Embeddings |
|
|
|
[](https://doi.org/10.5281/zenodo.17083589) |
|
|
|
## Overview |
|
**GaroVec v1.0** is the *first publicly documented Latin-script Garo embedding model*. |
|
It combines: |
|
- **FastText embeddings** (English + Garo) |
|
- **Cross-lingual alignment** using Procrustes rotation |
|
- **Frequency-based bilingual dictionary** for high-confidence word translations |
|
|
|
This hybrid design provides both **semantic embeddings** and **direct dictionary lookups**, making it useful for cross-lingual tasks like translation support, lexicon building, and low-resource NLP research. |
|
|
|
--- |
|
|
|
## Training |
|
- **Data size**: ~2,500 English ↔ Garo parallel sentences |
|
(synthetic English generated with open models, manually translated by native Garo speakers) |
|
- **Method**: |
|
- FastText skipgram (300-dimensional vectors, char n-grams 3–6, 25 epochs) |
|
- Linear alignment (Procrustes) between English and Garo vector spaces |
|
- Frequency-based dictionary extracted from the parallel corpus |
|
- **Artifacts**: |
|
- `garovec_garo.bin` — Garo FastText embeddings |
|
- `garovec_english.bin` — English FastText embeddings |
|
- `garovec_alignment_matrix.npy` — alignment matrix |
|
- `garovec_model.pkl` — final hybrid model with dictionary + embeddings |
|
|
|
--- |
|
|
|
## Usage |
|
|
|
```python |
|
import pickle |
|
import fasttext |
|
import numpy as np |
|
|
|
# Load hybrid model data |
|
with open("garovec_model.pkl", "rb") as f: |
|
garovec_data = pickle.load(f) |
|
|
|
# Load embeddings |
|
garo_model = fasttext.load_model("garovec_garo.bin") |
|
english_model = fasttext.load_model("garovec_english.bin") |
|
W = np.load("garovec_alignment_matrix.npy") |
|
|
|
# Example: get nearest Garo words for English word |
|
vec = english_model.get_word_vector("love") |
|
aligned_vec = vec @ W |
|
candidates = [garo_model.get_word_vector(w) for w in garo_model.words[:100]] |
|
``` |
|
|
|
--- |
|
|
|
## Limitations |
|
- Trained on a **small parallel dataset** (~2.5k sentences). |
|
- Optimized for **demonstrations, lexicon building, and low-resource NLP exploration**. |
|
- Future versions will incorporate more data and advanced techniques for broader coverage. |
|
|
|
--- |
|
|
|
## License |
|
- **Model weights & code**: CC-BY-SA 4.0 |
|
- **Training data**: private, not released |
|
|
|
--- |
|
|
|
## Citation |
|
If you use **GaroVec v1.0**, please cite: |
|
|
|
``` |
|
MWire Labs. 2025. GaroVec v1.0 — Hybrid English↔Garo Embeddings. Hugging Face. |
|
``` |
|
--- |
|
|
|
## Acknowledgements |
|
Built by **MWire Labs** with contributions from native Garo speakers. |
|
This release is part of the **NEODAC project** (Northeast India Domain-Adapted Corpus). |