Nepali-English Tokenizer

This repository contains a custom tokenizer built to handle both Nepali (Devanagari) and English languages, facilitating seamless tokenization for NLP tasks that involve code-switching or translation between these two languages.

Features

Multilingual Support: This tokenizer is specifically designed for both Nepali and English, capable of handling complex scripts like Devanagari alongside the Roman alphabet.
Bidirectional Encoding: Efficiently encodes and decodes both languages for tasks such as translation, language modeling, and sequence tagging.
Custom Vocab: The vocabulary is tailored for common words and tokens in both Nepali and English, making it optimal for tasks involving both languages.

Installation

You can easily install and use the tokenizer with Hugging Face’s transformers library.

pip install transformers

To use the tokenizer, load it directly from Hugging Face:

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained('chetanpun/devanagari-english-bpe-tokenizer')

To tokenize text in either Nepali or English:

nepali_text = "मेरो नाम चेतन हो।"

english_text = "My name is Chetan."

mix_text = "Hello मेरो नाम चेतन हो।"

nepali_tokens = tokenizer(nepali_text)
english_tokens = tokenizer(english_text)

print("Nepali Tokens:", nepali_tokens)
print("English Tokens:", english_tokens)

You can encode text into token IDs and decode them back:

input_ids = tokenizer.encode(nepali_text, return_tensors="pt")

decoded_text = tokenizer.decode(input_ids[0], skip_special_tokens=True)

print("Decoded Text:", decoded_text)