Nepali-English Tokenizer
This repository contains a custom tokenizer built to handle both Nepali (Devanagari) and English languages, facilitating seamless tokenization for NLP tasks that involve code-switching or translation between these two languages.
Features
- Multilingual Support: This tokenizer is specifically designed for both Nepali and English, capable of handling complex scripts like Devanagari alongside the Roman alphabet.
- Bidirectional Encoding: Efficiently encodes and decodes both languages for tasks such as translation, language modeling, and sequence tagging.
- Custom Vocab: The vocabulary is tailored for common words and tokens in both Nepali and English, making it optimal for tasks involving both languages.
Installation
You can easily install and use the tokenizer with Hugging Face’s transformers
library.
pip install transformers
To use the tokenizer, load it directly from Hugging Face:
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained('chetanpun/devanagari-english-bpe-tokenizer')
To tokenize text in either Nepali or English:
nepali_text = "मेरो नाम चेतन हो।"
english_text = "My name is Chetan."
mix_text = "Hello मेरो नाम चेतन हो।"
nepali_tokens = tokenizer(nepali_text)
english_tokens = tokenizer(english_text)
print("Nepali Tokens:", nepali_tokens)
print("English Tokens:", english_tokens)
You can encode text into token IDs and decode them back:
input_ids = tokenizer.encode(nepali_text, return_tensors="pt")
decoded_text = tokenizer.decode(input_ids[0], skip_special_tokens=True)
print("Decoded Text:", decoded_text)