GaroVec v1.0 — Hybrid English↔Garo Embeddings

DOI

Overview

GaroVec v1.0 is the first publicly documented Latin-script Garo embedding model.
It combines:

  • FastText embeddings (English + Garo)
  • Cross-lingual alignment using Procrustes rotation
  • Frequency-based bilingual dictionary for high-confidence word translations

This hybrid design provides both semantic embeddings and direct dictionary lookups, making it useful for cross-lingual tasks like translation support, lexicon building, and low-resource NLP research.


Training

  • Data size: ~2,500 English ↔ Garo parallel sentences
    (synthetic English generated with open models, manually translated by native Garo speakers)
  • Method:
    • FastText skipgram (300-dimensional vectors, char n-grams 3–6, 25 epochs)
    • Linear alignment (Procrustes) between English and Garo vector spaces
    • Frequency-based dictionary extracted from the parallel corpus
  • Artifacts:
    • garovec_garo.bin — Garo FastText embeddings
    • garovec_english.bin — English FastText embeddings
    • garovec_alignment_matrix.npy — alignment matrix
    • garovec_model.pkl — final hybrid model with dictionary + embeddings

Usage

import pickle
import fasttext
import numpy as np

# Load hybrid model data
with open("garovec_model.pkl", "rb") as f:
    garovec_data = pickle.load(f)

# Load embeddings
garo_model = fasttext.load_model("garovec_garo.bin")
english_model = fasttext.load_model("garovec_english.bin")
W = np.load("garovec_alignment_matrix.npy")

# Example: get nearest Garo words for English word
vec = english_model.get_word_vector("love")
aligned_vec = vec @ W
candidates = [garo_model.get_word_vector(w) for w in garo_model.words[:100]]

Limitations

  • Trained on a small parallel dataset (~2.5k sentences).
  • Optimized for demonstrations, lexicon building, and low-resource NLP exploration.
  • Future versions will incorporate more data and advanced techniques for broader coverage.

License

  • Model weights & code: CC-BY-SA 4.0
  • Training data: private, not released

Citation

If you use GaroVec v1.0, please cite:

MWire Labs. 2025. GaroVec v1.0 — Hybrid English↔Garo Embeddings. Hugging Face.

Acknowledgements

Built by MWire Labs with contributions from native Garo speakers.
This release is part of the NEODAC project (Northeast India Domain-Adapted Corpus).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support