Badnyal commited on
Commit
9f593c3
·
verified ·
1 Parent(s): 9692733

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +48 -0
README.md ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GaroVec v1.0 — Hybrid English↔Garo Embeddings
2
+
3
+ ## Overview
4
+ **GaroVec v1.0** is the *first publicly documented Latin-script Garo embedding model*.
5
+ It combines:
6
+ - **FastText embeddings** (English + Garo)
7
+ - **Cross-lingual alignment** using Procrustes rotation
8
+ - **Frequency-based bilingual dictionary** for high-confidence word translations
9
+
10
+ This hybrid design provides both **semantic embeddings** and **direct dictionary lookups**, making it useful for cross-lingual tasks like translation support, lexicon building, and low-resource NLP research.
11
+
12
+ ---
13
+
14
+ ## Training
15
+ - **Data size**: ~2,500 English ↔ Garo parallel sentences
16
+ (synthetic English generated with open models, manually translated by native Garo speakers)
17
+ - **Method**:
18
+ - FastText skipgram (300-dimensional vectors, char n-grams 3–6, 25 epochs)
19
+ - Linear alignment (Procrustes) between English and Garo vector spaces
20
+ - Frequency-based dictionary extracted from the parallel corpus
21
+ - **Artifacts**:
22
+ - `garovec_garo.bin` — Garo FastText embeddings
23
+ - `garovec_english.bin` — English FastText embeddings
24
+ - `garovec_alignment_matrix.npy` — alignment matrix
25
+ - `garovec_model.pkl` — final hybrid model with dictionary + embeddings
26
+
27
+ ---
28
+
29
+ ## Usage
30
+
31
+ ```python
32
+ import pickle
33
+ import fasttext
34
+ import numpy as np
35
+
36
+ # Load hybrid model data
37
+ with open("garovec_model.pkl", "rb") as f:
38
+ garovec_data = pickle.load(f)
39
+
40
+ # Load embeddings
41
+ garo_model = fasttext.load_model("garovec_garo.bin")
42
+ english_model = fasttext.load_model("garovec_english.bin")
43
+ W = np.load("garovec_alignment_matrix.npy")
44
+
45
+ # Example: get nearest Garo words for English word
46
+ vec = english_model.get_word_vector("love")
47
+ aligned_vec = vec @ W
48
+ candidates = [garo_model.get_word_vector(w) for w in garo_model.words[:100]]