Hailay commited on
Commit
725597f
ยท
verified ยท
1 Parent(s): 586524e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -0
README.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - am
4
+ - ti
5
+ license: mit
6
+ tags:
7
+ - tokenizer
8
+ - byte-pair-encoding
9
+ - bpe
10
+ - geez-script
11
+ - amharic
12
+ - tigrinya
13
+ - low-resource
14
+ - nlp
15
+ - morphology-aware
16
+ - Horn of Africa
17
+ datasets:
18
+ - HornMT
19
+ library_name: transformers
20
+ pipeline_tag: token-classification
21
+ widget:
22
+ - text: "!"
23
+ model-index:
24
+ - name: Geez BPE Tokenizer
25
+ results: []
26
+ ---
27
+
28
+ # Geez Tokenizer (`Hailay/geez-tokenizer`)
29
+
30
+ A **BPE tokenizer** specifically trained for **Geez-script languages**, including **Tigrinya** and **Amharic**. The tokenizer is trained on monolingual corpora derived from the [HornMT](https://github.com/HornMT) project and supports morphologically rich low-resource languages.
31
+
32
+ ## ๐Ÿง  Motivation
33
+
34
+ Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently. This tokenizer aims to:
35
+
36
+ - Reduce over-segmentation errors
37
+ - Respect morpheme boundaries
38
+ - Improve language understanding for downstream tasks like Machine Translation and QA
39
+
40
+ ## ๐Ÿ“š Training Details
41
+
42
+ - **Tokenizer Type**: BPE
43
+ - **Vocabulary Size**: 32,000
44
+ - **Pre-tokenizer**: Whitespace
45
+ - **Normalizer**: NFD โ†’ Lowercase โ†’ StripAccents
46
+ - **Special Tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]`
47
+ - **Post-processing**: Template for `[CLS] $A [SEP]` and `[CLS] $A [SEP] $B [SEP]`
48
+
49
+ ## ๐Ÿ“ Files
50
+
51
+ - `vocab.json`: Vocabulary file
52
+ - `merges.txt`: Merge rules for BPE
53
+ - `tokenizer.json`: Full tokenizer config
54
+ - `tokenizer_config.json`: Hugging Face-compatible configuration
55
+ - `special_tokens_map.json`: Maps for special tokens
56
+
57
+ ## ๐Ÿš€ Usage
58
+
59
+ ```python
60
+ from transformers import PreTrainedTokenizerFast
61
+
62
+ tokenizer = PreTrainedTokenizerFast.from_pretrained("Hailay/geez-tokenizer")
63
+
64
+ text = "แ‹จแŒแ‰ฅแ… แŠ แˆญแŠชแŠฆแˆŽแŒ‚แˆตแ‰ถแ‰ฝ แ‰ แˆณแŠฉแˆซ แŠ”แŠญแˆฎแ–แˆŠแˆต แ‹แˆตแŒฅ แ‹จแ‰ฐแŒˆแŠ˜แ‹แŠ• แ‰ตแˆแ‰แŠ• แˆ˜แ‰ƒแ‰ฅแˆญ แŠ แŒแŠแ‰ฐแ‹‹แˆแข"
65
+ tokens = tokenizer.tokenize(text)
66
+ ids = tokenizer.encode(text)
67
+
68
+ print("Tokens:", tokens)
69
+ print("Token IDs:", ids)
70
+
71
+ ## ๐Ÿ“Š Intended Use
72
+
73
+ This tokenizer is best suited for:
74
+
75
+ Low-resource NLP pipelines
76
+
77
+ Machine Translation
78
+
79
+ Question Answering
80
+
81
+ Named Entity Recognition
82
+
83
+ Morphological analysis
84
+
85
+
86
+
87
+ โœ‹ #**Limitations**
88
+ It is optimized for Geez-script languages and might not generalize to others.
89
+
90
+ Some compound verbs and morphologically fused words may still require linguistic preprocessing.
91
+
92
+ Currently monolingual for Amharic and Tigrinya; does not support multilingual code-switching.
93
+
94
+
95
+ โœ… #**Evaluation**
96
+ The tokenizer was evaluated manually on:
97
+
98
+ Token coverage of Tigrinya/Amharic corpora
99
+
100
+ Morphological preservation
101
+
102
+ Reduction of BPE segmentation errors
103
+
104
+ Quantitative metrics to be published in an accompanying paper.
105
+
106
+ ๐Ÿ“œ #**License**
107
+ This tokenizer is licensed under the MIT License.
108
+ ๐Ÿ“Œ #**Citation**
109
+
110
+ @misc{hailay2025geez,
111
+ title={Geสฝez Script_Tokenizer: A Morpheme-Aware BPE Tokenizer for Geez Script Languages},
112
+ author={Teklehaymanot, Hailay},
113
+ year={2025},
114
+ howpublished={\url{https://huggingface.co/Hailay/geez-tokenizer}},
115
+ }