GaroVec v1.0 — Hybrid English↔Garo Embeddings
Overview
GaroVec v1.0 is the first publicly documented Latin-script Garo embedding model.
It combines:
- FastText embeddings (English + Garo)
- Cross-lingual alignment using Procrustes rotation
- Frequency-based bilingual dictionary for high-confidence word translations
This hybrid design provides both semantic embeddings and direct dictionary lookups, making it useful for cross-lingual tasks like translation support, lexicon building, and low-resource NLP research.
Training
- Data size: ~2,500 English ↔ Garo parallel sentences
(synthetic English generated with open models, manually translated by native Garo speakers) - Method:
- FastText skipgram (300-dimensional vectors, char n-grams 3–6, 25 epochs)
- Linear alignment (Procrustes) between English and Garo vector spaces
- Frequency-based dictionary extracted from the parallel corpus
- Artifacts:
garovec_garo.bin
— Garo FastText embeddingsgarovec_english.bin
— English FastText embeddingsgarovec_alignment_matrix.npy
— alignment matrixgarovec_model.pkl
— final hybrid model with dictionary + embeddings
Usage
import pickle
import fasttext
import numpy as np
# Load hybrid model data
with open("garovec_model.pkl", "rb") as f:
garovec_data = pickle.load(f)
# Load embeddings
garo_model = fasttext.load_model("garovec_garo.bin")
english_model = fasttext.load_model("garovec_english.bin")
W = np.load("garovec_alignment_matrix.npy")
# Example: get nearest Garo words for English word
vec = english_model.get_word_vector("love")
aligned_vec = vec @ W
candidates = [garo_model.get_word_vector(w) for w in garo_model.words[:100]]
Limitations
- Trained on a small parallel dataset (~2.5k sentences).
- Optimized for demonstrations, lexicon building, and low-resource NLP exploration.
- Future versions will incorporate more data and advanced techniques for broader coverage.
License
- Model weights & code: CC-BY-SA 4.0
- Training data: private, not released
Citation
If you use GaroVec v1.0, please cite:
MWire Labs. 2025. GaroVec v1.0 — Hybrid English↔Garo Embeddings. Hugging Face.
Acknowledgements
Built by MWire Labs with contributions from native Garo speakers.
This release is part of the NEODAC project (Northeast India Domain-Adapted Corpus).
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support