library_name: transformers language: - si
Sinhala GPT Tokenizer
This is a custom tokenizer trained specifically for the Sinhala language using Byte-Pair Encoding (BPE), intended to be used in GPT-style language models and NLP applications. The tokenizer was trained on a Sinhala dataset to provide high-quality tokenization for downstream tasks like language modeling, classification, or generation.
Model Details
Model Description
This tokenizer is designed for Sinhala language processing. It has been trained on a cleaned 250MB Sinhala corpus using the Hugging Face tokenizers
library. The training strategy is optimized for use in GPT-style auto-regressive models, with ByteLevel pre-tokenization and decoding for better whitespace handling.
- Developed by: [Navanjana]
- Model type: BPE Tokenizer (ByteLevel)
- Language(s): Sinhala (si)
- License: Apache 2.0
- Trained with: Hugging Face Tokenizers
- Vocab size: 50000 tokens
- Special tokens:
<s>
,<pad>
,</s>
,<unk>
,<mask>
Model Sources
- Repository: [https://huggingface.co/Navanjana/sinhala-gpt-tokenizer(https://huggingface.co/Navanjana/sinhala-gpt-tokenizer)
Uses
Direct Use
You can use this tokenizer directly for any Sinhala text tokenization or for training/fine-tuning a GPT model.
Downstream Use
Use this tokenizer to:
- Fine-tune GPT-style models on Sinhala data
- Power chatbots and Sinhala NLP tools
- Preprocess Sinhala corpora
Out-of-Scope Use
- Non-Sinhala languages (as the tokenizer is not trained for them)
- Code or multi-language datasets (unless retrained)
Bias, Risks, and Limitations
This tokenizer reflects the patterns of its training data. If your corpus contains biased, offensive, or skewed data, those patterns may be reflected in tokenization and downstream generation. Always evaluate on real-world examples.
Recommendations
- Retrain or fine-tune the tokenizer on domain-specific or cleaner datasets for high-stakes use cases.
- Use filtering/preprocessing before training downstream models.
How to Get Started with the Tokenizer
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("Navanjana/sinhala-gpt-tokenizer")
text = "ඔබට සුබ දවසක් වේවා"
tokens = tokenizer.tokenize(text)
print(tokens)