File size: 2,762 Bytes
21460f1 8d3c994 21460f1 8d3c994 21460f1 9f593c3 6fc4873 9f593c3 8d3c994 ab542e2 49652b4 8d3c994 49652b4 8d3c994 49652b4 8d3c994 49652b4 8d3c994 49652b4 8d3c994 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
---
language:
- en
- grt
tags:
- embeddings
- bilingual
- garo
- low-resource
license: cc-by-sa-4.0
datasets:
- private
model-index:
- name: GaroVec v1.0
results: []
---
# GaroVec v1.0 — Hybrid English↔Garo Embeddings
[](https://doi.org/10.5281/zenodo.17083589)
## Overview
**GaroVec v1.0** is the *first publicly documented Latin-script Garo embedding model*.
It combines:
- **FastText embeddings** (English + Garo)
- **Cross-lingual alignment** using Procrustes rotation
- **Frequency-based bilingual dictionary** for high-confidence word translations
This hybrid design provides both **semantic embeddings** and **direct dictionary lookups**, making it useful for cross-lingual tasks like translation support, lexicon building, and low-resource NLP research.
---
## Training
- **Data size**: ~2,500 English ↔ Garo parallel sentences
(synthetic English generated with open models, manually translated by native Garo speakers)
- **Method**:
- FastText skipgram (300-dimensional vectors, char n-grams 3–6, 25 epochs)
- Linear alignment (Procrustes) between English and Garo vector spaces
- Frequency-based dictionary extracted from the parallel corpus
- **Artifacts**:
- `garovec_garo.bin` — Garo FastText embeddings
- `garovec_english.bin` — English FastText embeddings
- `garovec_alignment_matrix.npy` — alignment matrix
- `garovec_model.pkl` — final hybrid model with dictionary + embeddings
---
## Usage
```python
import pickle
import fasttext
import numpy as np
# Load hybrid model data
with open("garovec_model.pkl", "rb") as f:
garovec_data = pickle.load(f)
# Load embeddings
garo_model = fasttext.load_model("garovec_garo.bin")
english_model = fasttext.load_model("garovec_english.bin")
W = np.load("garovec_alignment_matrix.npy")
# Example: get nearest Garo words for English word
vec = english_model.get_word_vector("love")
aligned_vec = vec @ W
candidates = [garo_model.get_word_vector(w) for w in garo_model.words[:100]]
```
---
## Limitations
- Trained on a **small parallel dataset** (~2.5k sentences).
- Optimized for **demonstrations, lexicon building, and low-resource NLP exploration**.
- Future versions will incorporate more data and advanced techniques for broader coverage.
---
## License
- **Model weights & code**: CC-BY-SA 4.0
- **Training data**: private, not released
---
## Citation
If you use **GaroVec v1.0**, please cite:
```
MWire Labs. 2025. GaroVec v1.0 — Hybrid English↔Garo Embeddings. Hugging Face.
```
---
## Acknowledgements
Built by **MWire Labs** with contributions from native Garo speakers.
This release is part of the **NEODAC project** (Northeast India Domain-Adapted Corpus). |