Geez Tokenizer (`Hailay/geez-tokenizer`)

A BPE tokenizer specifically trained for Geez-script languages, including Tigrinya and Amharic. The tokenizer is trained on monolingual corpora derived from the HornMT project and supports morphologically rich low-resource languages.

🧠 Motivation

Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently. This tokenizer aims to:

Reduce over-segmentation errors
Respect morpheme boundaries
Improve language understanding for downstream tasks like Machine Translation and QA

📚 Training Details

Tokenizer Type: BPE
Vocabulary Size: 32,000
Pre-tokenizer: Whitespace
Normalizer: NFD → Lowercase → StripAccents
Special Tokens: [PAD], [UNK], [CLS], [SEP], [MASK]
Post-processing: Template for [CLS] $A [SEP] and [CLS] $A [SEP] $B [SEP]

📁 Files

vocab.json: Vocabulary file
merges.txt: Merge rules for BPE
tokenizer.json: Full tokenizer config
tokenizer_config.json: Hugging Face-compatible configuration
special_tokens_map.json: Maps for special tokens

🚀 Usage

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Hailay/geez-tokenizer")

text = "የግብፅ አርኪኦሎጂስቶች በሳኩራ ኔክሮፖሊስ ውስጥ የተገኘውን ትልቁን መቃብር አግኝተዋል።"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print("Tokens:", tokens)
print("Token IDs:", ids)

## 📊 Intended Use

This tokenizer is best suited for:

Low-resource NLP pipelines

Machine Translation

Question Answering

Named Entity Recognition

Morphological analysis



✋ #**Limitations**
It is optimized for Geez-script languages and might not generalize to others.

Some compound verbs and morphologically fused words may still require linguistic preprocessing.

Currently monolingual for Amharic and Tigrinya; does not support multilingual code-switching.


✅ #**Evaluation**
The tokenizer was evaluated manually on:

Token coverage of Tigrinya/Amharic corpora

Morphological preservation

Reduction of BPE segmentation errors

Quantitative metrics to be published in an accompanying paper.

📜 #**License**
This tokenizer is licensed under the MIT License.
📌 #**Citation**

@misc{hailay2025geez,
  title={Geʽez Script_Tokenizer: A Morpheme-Aware BPE Tokenizer for Geez Script Languages},
  author={Teklehaymanot, Hailay},
  year={2025},
  howpublished={\url{https://huggingface.co/Hailay/geez-tokenizer}},
}

Hailay
/

geez-tokenizer

Geez Tokenizer (`Hailay/geez-tokenizer`)

🧠 Motivation

📚 Training Details

📁 Files

🚀 Usage

Evaluation results

Geez Tokenizer (Hailay/geez-tokenizer)

🧠 Motivation

📚 Training Details

📁 Files

🚀 Usage

Evaluation results

Geez Tokenizer (`Hailay/geez-tokenizer`)