Geez Tokenizer (Hailay/geez-tokenizer
)
A BPE tokenizer specifically trained for Geez-script languages, including Tigrinya and Amharic. The tokenizer is trained on monolingual corpora derived from the HornMT project and supports morphologically rich low-resource languages.
๐ง Motivation
Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently. This tokenizer aims to:
- Reduce over-segmentation errors
- Respect morpheme boundaries
- Improve language understanding for downstream tasks like Machine Translation and QA
๐ Training Details
- Tokenizer Type: BPE
- Vocabulary Size: 32,000
- Pre-tokenizer: Whitespace
- Normalizer: NFD โ Lowercase โ StripAccents
- Special Tokens:
[PAD]
,[UNK]
,[CLS]
,[SEP]
,[MASK]
- Post-processing: Template for
[CLS] $A [SEP]
and[CLS] $A [SEP] $B [SEP]
๐ Files
vocab.json
: Vocabulary filemerges.txt
: Merge rules for BPEtokenizer.json
: Full tokenizer configtokenizer_config.json
: Hugging Face-compatible configurationspecial_tokens_map.json
: Maps for special tokens
๐ Usage
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("Hailay/geez-tokenizer")
text = "แจแแฅแ
แ แญแชแฆแแแตแถแฝ แ แณแฉแซ แแญแฎแแแต แแตแฅ แจแฐแแแแ แตแแแ แแแฅแญ แ แแแฐแแแข"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print("Tokens:", tokens)
print("Token IDs:", ids)
## ๐ Intended Use
This tokenizer is best suited for:
Low-resource NLP pipelines
Machine Translation
Question Answering
Named Entity Recognition
Morphological analysis
โ #**Limitations**
It is optimized for Geez-script languages and might not generalize to others.
Some compound verbs and morphologically fused words may still require linguistic preprocessing.
Currently monolingual for Amharic and Tigrinya; does not support multilingual code-switching.
โ
#**Evaluation**
The tokenizer was evaluated manually on:
Token coverage of Tigrinya/Amharic corpora
Morphological preservation
Reduction of BPE segmentation errors
Quantitative metrics to be published in an accompanying paper.
๐ #**License**
This tokenizer is licensed under the MIT License.
๐ #**Citation**
@misc{hailay2025geez,
title={Geสฝez Script_Tokenizer: A Morpheme-Aware BPE Tokenizer for Geez Script Languages},
author={Teklehaymanot, Hailay},
year={2025},
howpublished={\url{https://huggingface.co/Hailay/geez-tokenizer}},
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support