MWirelabs
/

GaroVec

Model card Files Files and versions

GaroVec / README.md

Badnyal's picture

Update README.md

6fc4873 verified 7 days ago

|

history blame contribute delete

2.76 kB

	---
	language:
	- en
	- grt
	tags:
	- embeddings
	- bilingual
	- garo
	- low-resource
	license: cc-by-sa-4.0
	datasets:
	- private
	model-index:
	- name: GaroVec v1.0
	results: []
	---

	# GaroVec v1.0 — Hybrid English↔Garo Embeddings

	[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17083589.svg)](https://doi.org/10.5281/zenodo.17083589)

	## Overview
	GaroVec v1.0 is the first publicly documented Latin-script Garo embedding model.
	It combines:
	- FastText embeddings (English + Garo)
	- Cross-lingual alignment using Procrustes rotation
	- Frequency-based bilingual dictionary for high-confidence word translations

	This hybrid design provides both semantic embeddings and direct dictionary lookups, making it useful for cross-lingual tasks like translation support, lexicon building, and low-resource NLP research.

	---

	## Training
	- Data size: ~2,500 English ↔ Garo parallel sentences
	(synthetic English generated with open models, manually translated by native Garo speakers)
	- Method:
	- FastText skipgram (300-dimensional vectors, char n-grams 3–6, 25 epochs)
	- Linear alignment (Procrustes) between English and Garo vector spaces
	- Frequency-based dictionary extracted from the parallel corpus
	- Artifacts:
	- `garovec_garo.bin` — Garo FastText embeddings
	- `garovec_english.bin` — English FastText embeddings
	- `garovec_alignment_matrix.npy` — alignment matrix
	- `garovec_model.pkl` — final hybrid model with dictionary + embeddings

	---

	## Usage

	```python
	import pickle
	import fasttext
	import numpy as np

	# Load hybrid model data
	with open("garovec_model.pkl", "rb") as f:
	garovec_data = pickle.load(f)

	# Load embeddings
	garo_model = fasttext.load_model("garovec_garo.bin")
	english_model = fasttext.load_model("garovec_english.bin")
	W = np.load("garovec_alignment_matrix.npy")

	# Example: get nearest Garo words for English word
	vec = english_model.get_word_vector("love")
	aligned_vec = vec @ W
	candidates = [garo_model.get_word_vector(w) for w in garo_model.words[:100]]
	```

	---

	## Limitations
	- Trained on a small parallel dataset (~2.5k sentences).
	- Optimized for demonstrations, lexicon building, and low-resource NLP exploration.
	- Future versions will incorporate more data and advanced techniques for broader coverage.

	---

	## License
	- Model weights & code: CC-BY-SA 4.0
	- Training data: private, not released

	---

	## Citation
	If you use GaroVec v1.0, please cite:

	```
	MWire Labs. 2025. GaroVec v1.0 — Hybrid English↔Garo Embeddings. Hugging Face.
	```
	---

	## Acknowledgements
	Built by MWire Labs with contributions from native Garo speakers.
	This release is part of the NEODAC project (Northeast India Domain-Adapted Corpus).