Romanized Sinhala Tokenizer
This tokenizer is specifically trained for Romanized Sinhala text (Sinhala written in Latin alphabet).
Details
- Based on mBART's tokenization approach (BPE)
- Trained on the Swabhasha Romanized Sinhala Dataset
- Includes custom language code "si_rom" for Romanized Sinhala
- Compatible with sequence-to-sequence models
Usage
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("deshanksuman/romanized-sinhala-tokenizer")
# Set language for encoding
tokenizer.src_lang = "si_rom"
# Encode text
encoded = tokenizer("Romanized Sinhala text goes here", return_tensors="pt")
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support