Romanized Sinhala Tokenizer

This tokenizer is specifically trained for Romanized Sinhala text (Sinhala written in Latin alphabet).

Details

  • Based on mBART's tokenization approach (BPE)
  • Trained on the Swabhasha Romanized Sinhala Dataset
  • Includes custom language code "si_rom" for Romanized Sinhala
  • Compatible with sequence-to-sequence models

Usage

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("deshanksuman/romanized-sinhala-tokenizer")

# Set language for encoding
tokenizer.src_lang = "si_rom"

# Encode text
encoded = tokenizer("Romanized Sinhala text goes here", return_tensors="pt")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train deshanksuman/romanized-sinhala-tokenizer