File size: 2,762 Bytes
21460f1
8d3c994
21460f1
8d3c994
21460f1
 
 
 
 
 
 
 
 
 
 
 
 
9f593c3
6fc4873
 
9f593c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d3c994
 
 
 
 
 
ab542e2
 
49652b4
8d3c994
49652b4
8d3c994
 
 
49652b4
8d3c994
 
 
 
49652b4
8d3c994
 
 
49652b4
 
8d3c994
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
language:
- en
- grt
tags:
- embeddings
- bilingual
- garo
- low-resource
license: cc-by-sa-4.0
datasets:
- private
model-index:
- name: GaroVec v1.0
  results: []
---

# GaroVec v1.0 — Hybrid English↔Garo Embeddings

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17083589.svg)](https://doi.org/10.5281/zenodo.17083589)

## Overview
**GaroVec v1.0** is the *first publicly documented Latin-script Garo embedding model*.  
It combines:
- **FastText embeddings** (English + Garo)
- **Cross-lingual alignment** using Procrustes rotation
- **Frequency-based bilingual dictionary** for high-confidence word translations

This hybrid design provides both **semantic embeddings** and **direct dictionary lookups**, making it useful for cross-lingual tasks like translation support, lexicon building, and low-resource NLP research.

---

## Training
- **Data size**: ~2,500 English ↔ Garo parallel sentences  
  (synthetic English generated with open models, manually translated by native Garo speakers)  
- **Method**:
  - FastText skipgram (300-dimensional vectors, char n-grams 3–6, 25 epochs)
  - Linear alignment (Procrustes) between English and Garo vector spaces
  - Frequency-based dictionary extracted from the parallel corpus
- **Artifacts**:
  - `garovec_garo.bin` — Garo FastText embeddings
  - `garovec_english.bin` — English FastText embeddings
  - `garovec_alignment_matrix.npy` — alignment matrix
  - `garovec_model.pkl` — final hybrid model with dictionary + embeddings

---

## Usage

```python
import pickle
import fasttext
import numpy as np

# Load hybrid model data
with open("garovec_model.pkl", "rb") as f:
    garovec_data = pickle.load(f)

# Load embeddings
garo_model = fasttext.load_model("garovec_garo.bin")
english_model = fasttext.load_model("garovec_english.bin")
W = np.load("garovec_alignment_matrix.npy")

# Example: get nearest Garo words for English word
vec = english_model.get_word_vector("love")
aligned_vec = vec @ W
candidates = [garo_model.get_word_vector(w) for w in garo_model.words[:100]]
```

---

## Limitations
- Trained on a **small parallel dataset** (~2.5k sentences).  
- Optimized for **demonstrations, lexicon building, and low-resource NLP exploration**.  
- Future versions will incorporate more data and advanced techniques for broader coverage. 

---

## License
- **Model weights & code**: CC-BY-SA 4.0  
- **Training data**: private, not released

---

## Citation
If you use **GaroVec v1.0**, please cite:

```
MWire Labs. 2025. GaroVec v1.0 — Hybrid English↔Garo Embeddings. Hugging Face.
```
---

## Acknowledgements
Built by **MWire Labs** with contributions from native Garo speakers.  
This release is part of the **NEODAC project** (Northeast India Domain-Adapted Corpus).