Spaces:
Running
A newer version of the Gradio SDK is available:
5.43.1
title: Turkish Tokenizer
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: 5.42.0
app_file: app.py
pinned: false
license: cc-by-nc-nd-4.0
short_description: Turkish Morphological Tokenizer
Turkish Tokenizer
A sophisticated Turkish text tokenizer with morphological analysis, built with Gradio for easy visualization and interaction.
Features
- Morphological Analysis: Breaks down Turkish words into roots, suffixes, and BPE tokens
- Visual Tokenization: Color-coded token display with interactive highlighting
- Statistics Dashboard: Detailed analytics including compression ratios and token distribution
- Real-time Processing: Instant tokenization with live statistics
- Example Texts: Pre-loaded Turkish examples for testing
How to Use
- Enter Turkish text in the input field
- Click "🚀 Tokenize" to process the text
- View the color-coded tokens in the visualization
- Check the statistics for detailed analysis
- See the encoded token IDs and decoded text
Token Types
- 🔴 Roots (ROOT): Base word forms
- 🔵 Suffixes (SUFFIX): Turkish grammatical suffixes
- 🟡 BPE: Byte Pair Encoding tokens for subword units
Examples
Try these example texts:
- "Merhaba Dünya! Bu bir gelişmiş Türkçe tokenizer testidir."
- "İstanbul'da yaşıyorum ve Türkçe dilini öğreniyorum."
- "KitapOkumak çok güzeldir ve bilgi verir."
- "Türkiye Cumhuriyeti'nin başkenti Ankara'dır."
- "Yapay zeka ve makine öğrenmesi teknolojileri gelişiyor."
Technical Details
This tokenizer uses:
- Custom morphological analysis for Turkish
- JSON-based vocabulary files
- Gradio for the web interface
- Advanced tokenization algorithms
Research Paper
This implementation is based on the research paper:
"Tokens with Meaning: A Hybrid Tokenization Approach for NLP"
Authors: M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, Demircan Çelik
Abstract: A hybrid tokenization framework combining rule-based morphological analysis with statistical subword segmentation improves tokenization for morphologically rich languages like Turkish. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency.
Please cite this paper if you use this tokenizer in your research:
@article{bayram2024tokens,
title={Tokens with Meaning: A Hybrid Tokenization Approach for NLP},
author={Bayram, M. Ali and Fincan, Ali Arda and Gümüş, Ahmet Semih and Karakaş, Sercan and Diri, Banu and Yıldırım, Savaş and Çelik, Demircan},
journal={arXiv preprint arXiv:2508.14292},
year={2025},
url={https://arxiv.org/abs/2508.14292}
}
Files
app.py
: Main Gradio applicationrequirements.txt
: Python dependencies
Local Development
To run locally:
pip install -r requirements.txt
python app.py
The app will be available at http://localhost:7860
Dependencies
gradio
: Web interface frameworkturkish-tokenizer
: Core tokenization library
License
This project is open source and available under the MIT License.