Geez Tokenizer (Hailay/geez-tokenizer)

A BPE tokenizer specifically trained for Geez-script languages, including Tigrinya and Amharic. The tokenizer is trained on monolingual corpora derived from the HornMT project and supports morphologically rich low-resource languages.

๐Ÿง  Motivation

Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently. This tokenizer aims to:

  • Reduce over-segmentation errors
  • Respect morpheme boundaries
  • Improve language understanding for downstream tasks like Machine Translation and QA

๐Ÿ“š Training Details

  • Tokenizer Type: BPE
  • Vocabulary Size: 32,000
  • Pre-tokenizer: Whitespace
  • Normalizer: NFD โ†’ Lowercase โ†’ StripAccents
  • Special Tokens: [PAD], [UNK], [CLS], [SEP], [MASK]
  • Post-processing: Template for [CLS] $A [SEP] and [CLS] $A [SEP] $B [SEP]

๐Ÿ“ Files

  • vocab.json: Vocabulary file
  • merges.txt: Merge rules for BPE
  • tokenizer.json: Full tokenizer config
  • tokenizer_config.json: Hugging Face-compatible configuration
  • special_tokens_map.json: Maps for special tokens

๐Ÿš€ Usage

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Hailay/geez-tokenizer")

text = "แ‹จแŒแ‰ฅแ… แŠ แˆญแŠชแŠฆแˆŽแŒ‚แˆตแ‰ถแ‰ฝ แ‰ แˆณแŠฉแˆซ แŠ”แŠญแˆฎแ–แˆŠแˆต แ‹แˆตแŒฅ แ‹จแ‰ฐแŒˆแŠ˜แ‹แŠ• แ‰ตแˆแ‰แŠ• แˆ˜แ‰ƒแ‰ฅแˆญ แŠ แŒแŠแ‰ฐแ‹‹แˆแข"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print("Tokens:", tokens)
print("Token IDs:", ids)

## ๐Ÿ“Š Intended Use

This tokenizer is best suited for:

Low-resource NLP pipelines

Machine Translation

Question Answering

Named Entity Recognition

Morphological analysis



โœ‹ #**Limitations**
It is optimized for Geez-script languages and might not generalize to others.

Some compound verbs and morphologically fused words may still require linguistic preprocessing.

Currently monolingual for Amharic and Tigrinya; does not support multilingual code-switching.


โœ… #**Evaluation**
The tokenizer was evaluated manually on:

Token coverage of Tigrinya/Amharic corpora

Morphological preservation

Reduction of BPE segmentation errors

Quantitative metrics to be published in an accompanying paper.

๐Ÿ“œ #**License**
This tokenizer is licensed under the MIT License.
๐Ÿ“Œ #**Citation**

@misc{hailay2025geez,
  title={Geสฝez Script_Tokenizer: A Morpheme-Aware BPE Tokenizer for Geez Script Languages},
  author={Teklehaymanot, Hailay},
  year={2025},
  howpublished={\url{https://huggingface.co/Hailay/geez-tokenizer}},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support