--- language: te tags: - telugu - sentencepiece - tokenizer - bpe - pretraining license: apache-2.0 datasets: - ai4bharat/sangraha library_name: transformers --- # šŸ”” Telugu BPE Tokenizer (23k vocab) — Vipplav A Byte-Pair Encoding (BPE) tokenizer trained on over **3.4 lakh cleaned Telugu text keys ** from the [AI4Bharat Sangraha dataset](https://huggingface.co/datasets/ai4bharat/sangraha) and other open sources. This tokenizer is ideal for **pretraining or fine-tuning Telugu language models**. --- ## šŸ“Œ Highlights - **Tokenizer Type**: SentencePiece BPE - **Vocabulary Size**: 23,000 - **Character Coverage**: 100% Telugu script - **Library**: šŸ¤— `transformers` + `sentencepiece` - **Special Tokens**: - `` — Unknown token - `` — Padding - `` — Start of sequence - `` — End of sequence - `\n`, `₹`, `•`, `-` — User-defined symbols preserved during training --- ## ✨ Example Usage ```python from transformers import T5Tokenizer # Load tokenizer from Hugging Face Hub tokenizer = T5Tokenizer.from_pretrained("Vipplav/telugu-bpe-23k") # Sample Telugu input text = "పరిశీలన తేదీ: 15-06-2025" # Tokenize the input tokens = tokenizer.tokenize(text) # Decode tokens back to text decoded = tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens), skip_special_tokens=True) # Display results print(f"\nšŸ“„ Input : {text}") print(f"šŸ”¤ Tokens : {tokens}") print(f"šŸ“ Decoded : {decoded}") ``` ## šŸ“œ Citation If you use this tokenizer, please cite: **APA:** > Vipplav AI (2025). *Telugu BPE Tokenizer (23k vocab)*. Hugging Face. https://huggingface.co/Vipplav/telugu-bpe-23k > AI4Bharat. (2023). *Sangraha: A Large-Scale Multidomain Corpus for Indian Languages*. Hugging Face Datasets. https://huggingface.co/datasets/ai4bharat/sangraha **BibTeX:** ```bibtex @misc{vipplav_telugu_tokenizer, author = {Vipplav AI}, title = {Telugu BPE Tokenizer (23k vocab)}, year = {2025}, url = {https://huggingface.co/Vipplav/telugu-bpe-23k} } @dataset{sangraha2023, author = {AI4Bharat}, title = {Sangraha Dataset}, year = {2023}, url = {https://huggingface.co/datasets/ai4bharat/sangraha} }