--- language: - am - ti license: mit tags: - tokenizer - byte-pair-encoding - bpe - geez-script - amharic - tigrinya - low-resource - nlp - morphology-aware - Horn of Africa datasets: - HornMT library_name: transformers pipeline_tag: token-classification widget: - text: "!" model-index: - name: Geez BPE Tokenizer results: [] --- # Geez Tokenizer (`Hailay/geez-tokenizer`) A **BPE tokenizer** specifically trained for **Geez-script languages**, including **Tigrinya** and **Amharic**. The tokenizer is trained on monolingual corpora derived from the [HornMT](https://github.com/HornMT) project and supports morphologically rich low-resource languages. ## 🧠 Motivation Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently. This tokenizer aims to: - Reduce over-segmentation errors - Respect morpheme boundaries - Improve language understanding for downstream tasks like Machine Translation and QA ## 📚 Training Details - **Tokenizer Type**: BPE - **Vocabulary Size**: 32,000 - **Pre-tokenizer**: Whitespace - **Normalizer**: NFD → Lowercase → StripAccents - **Special Tokens**: `[PAD]`, `[UNK]`, `[CLS]`, `[SEP]`, `[MASK]` - **Post-processing**: Template for `[CLS] $A [SEP]` and `[CLS] $A [SEP] $B [SEP]` ## 📁 Files - `vocab.json`: Vocabulary file - `merges.txt`: Merge rules for BPE - `tokenizer.json`: Full tokenizer config - `tokenizer_config.json`: Hugging Face-compatible configuration - `special_tokens_map.json`: Maps for special tokens ## 🚀 Usage ```python from transformers import PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast.from_pretrained("Hailay/geez-tokenizer") text = "የግብፅ አርኪኦሎጂስቶች በሳኩራ ኔክሮፖሊስ ውስጥ የተገኘውን ትልቁን መቃብር አግኝተዋል።" tokens = tokenizer.tokenize(text) ids = tokenizer.encode(text) print("Tokens:", tokens) print("Token IDs:", ids) ## 📊 Intended Use This tokenizer is best suited for: Low-resource NLP pipelines Machine Translation Question Answering Named Entity Recognition Morphological analysis ✋ #**Limitations** It is optimized for Geez-script languages and might not generalize to others. Some compound verbs and morphologically fused words may still require linguistic preprocessing. Currently monolingual for Amharic and Tigrinya; does not support multilingual code-switching. ✅ #**Evaluation** The tokenizer was evaluated manually on: Token coverage of Tigrinya/Amharic corpora Morphological preservation Reduction of BPE segmentation errors Quantitative metrics to be published in an accompanying paper. 📜 #**License** This tokenizer is licensed under the MIT License. 📌 #**Citation** @misc{hailay2025geez, title={Geʽez Script_Tokenizer: A Morpheme-Aware BPE Tokenizer for Geez Script Languages}, author={Teklehaymanot, Hailay}, year={2025}, howpublished={\url{https://huggingface.co/Hailay/geez-tokenizer}}, }