metadata

title: Turkish Tokenizer
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: 5.42.0
app_file: app.py
pinned: false
license: cc-by-nc-nd-4.0
short_description: Turkish Morphological Tokenizer

Turkish Tokenizer

A sophisticated Turkish text tokenizer with morphological analysis, built with Gradio for easy visualization and interaction.

Features

Morphological Analysis: Breaks down Turkish words into roots, suffixes, and BPE tokens
Visual Tokenization: Color-coded token display with interactive highlighting
Statistics Dashboard: Detailed analytics including compression ratios and token distribution
Real-time Processing: Instant tokenization with live statistics
Example Texts: Pre-loaded Turkish examples for testing

How to Use

Enter Turkish text in the input field
Click "🚀 Tokenize" to process the text
View the color-coded tokens in the visualization
Check the statistics for detailed analysis
See the encoded token IDs and decoded text

Token Types

🔴 Roots (ROOT): Base word forms
🔵 Suffixes (SUFFIX): Turkish grammatical suffixes
🟡 BPE: Byte Pair Encoding tokens for subword units

Examples

Try these example texts:

"Merhaba Dünya! Bu bir gelişmiş Türkçe tokenizer testidir."
"İstanbul'da yaşıyorum ve Türkçe dilini öğreniyorum."
"KitapOkumak çok güzeldir ve bilgi verir."
"Türkiye Cumhuriyeti'nin başkenti Ankara'dır."
"Yapay zeka ve makine öğrenmesi teknolojileri gelişiyor."

Technical Details

This tokenizer uses:

Custom morphological analysis for Turkish
JSON-based vocabulary files
Gradio for the web interface
Advanced tokenization algorithms

Research Paper

This implementation is based on the research paper:

"Tokens with Meaning: A Hybrid Tokenization Approach for NLP"

📄 arXiv:2508.14292

Authors: M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, Demircan Çelik

Abstract: A hybrid tokenization framework combining rule-based morphological analysis with statistical subword segmentation improves tokenization for morphologically rich languages like Turkish. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency.

Please cite this paper if you use this tokenizer in your research:

@article{bayram2024tokens,
  title={Tokens with Meaning: A Hybrid Tokenization Approach for NLP},
  author={Bayram, M. Ali and Fincan, Ali Arda and Gümüş, Ahmet Semih and Karakaş, Sercan and Diri, Banu and Yıldırım, Savaş and Çelik, Demircan},
  journal={arXiv preprint arXiv:2508.14292},
  year={2025},
  url={https://arxiv.org/abs/2508.14292}
}

Files

app.py: Main Gradio application
requirements.txt: Python dependencies

Local Development

To run locally:

pip install -r requirements.txt
python app.py

The app will be available at http://localhost:7860

Dependencies

gradio: Web interface framework
turkish-tokenizer: Core tokenization library

License

This project is open source and available under the MIT License.