Token Classification
Transformers
Sinhala

library_name: transformers language: - si

Sinhala GPT Tokenizer

This is a custom tokenizer trained specifically for the Sinhala language using Byte-Pair Encoding (BPE), intended to be used in GPT-style language models and NLP applications. The tokenizer was trained on a Sinhala dataset to provide high-quality tokenization for downstream tasks like language modeling, classification, or generation.


Model Details

Model Description

This tokenizer is designed for Sinhala language processing. It has been trained on a cleaned 250MB Sinhala corpus using the Hugging Face tokenizers library. The training strategy is optimized for use in GPT-style auto-regressive models, with ByteLevel pre-tokenization and decoding for better whitespace handling.

  • Developed by: [Navanjana]
  • Model type: BPE Tokenizer (ByteLevel)
  • Language(s): Sinhala (si)
  • License: Apache 2.0
  • Trained with: Hugging Face Tokenizers
  • Vocab size: 50000 tokens
  • Special tokens: <s>, <pad>, </s>, <unk>, <mask>

Model Sources


Uses

Direct Use

You can use this tokenizer directly for any Sinhala text tokenization or for training/fine-tuning a GPT model.

Downstream Use

Use this tokenizer to:

  • Fine-tune GPT-style models on Sinhala data
  • Power chatbots and Sinhala NLP tools
  • Preprocess Sinhala corpora

Out-of-Scope Use

  • Non-Sinhala languages (as the tokenizer is not trained for them)
  • Code or multi-language datasets (unless retrained)

Bias, Risks, and Limitations

This tokenizer reflects the patterns of its training data. If your corpus contains biased, offensive, or skewed data, those patterns may be reflected in tokenization and downstream generation. Always evaluate on real-world examples.

Recommendations

  • Retrain or fine-tune the tokenizer on domain-specific or cleaner datasets for high-stakes use cases.
  • Use filtering/preprocessing before training downstream models.

How to Get Started with the Tokenizer

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("Navanjana/sinhala-gpt-tokenizer")

text = "ඔබට සුබ දවසක් වේවා"
tokens = tokenizer.tokenize(text)
print(tokens)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Navanjana/sinhala-gpt-tokenizer