library_name: transformers language: - si

Sinhala GPT Tokenizer

This is a custom tokenizer trained specifically for the Sinhala language using Byte-Pair Encoding (BPE), intended to be used in GPT-style language models and NLP applications. The tokenizer was trained on a Sinhala dataset to provide high-quality tokenization for downstream tasks like language modeling, classification, or generation.

Model Details

Model Description

This tokenizer is designed for Sinhala language processing. It has been trained on a cleaned 250MB Sinhala corpus using the Hugging Face tokenizers library. The training strategy is optimized for use in GPT-style auto-regressive models, with ByteLevel pre-tokenization and decoding for better whitespace handling.

Developed by: [Navanjana]
Model type: BPE Tokenizer (ByteLevel)
Language(s): Sinhala (si)
License: Apache 2.0
Trained with: Hugging Face Tokenizers
Vocab size: 50000 tokens
Special tokens: <s>, <pad>, </s>, <unk>, <mask>

Model Sources

Repository: [https://huggingface.co/Navanjana/sinhala-gpt-tokenizer(https://huggingface.co/Navanjana/sinhala-gpt-tokenizer)

Uses

Direct Use

You can use this tokenizer directly for any Sinhala text tokenization or for training/fine-tuning a GPT model.

Downstream Use

Use this tokenizer to:

Fine-tune GPT-style models on Sinhala data
Power chatbots and Sinhala NLP tools
Preprocess Sinhala corpora

Out-of-Scope Use

Non-Sinhala languages (as the tokenizer is not trained for them)
Code or multi-language datasets (unless retrained)

Bias, Risks, and Limitations

This tokenizer reflects the patterns of its training data. If your corpus contains biased, offensive, or skewed data, those patterns may be reflected in tokenization and downstream generation. Always evaluate on real-world examples.

Recommendations

Retrain or fine-tune the tokenizer on domain-specific or cleaner datasets for high-stakes use cases.
Use filtering/preprocessing before training downstream models.

How to Get Started with the Tokenizer

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Navanjana/sinhala-gpt-tokenizer")

text = "ඔබට සුබ දවසක් වේවා"
tokens = tokenizer.tokenize(text)
print(tokens)

Navanjana
/

sinhala-gpt-tokenizer