alibayram's picture
Enhance README with a new section detailing the research paper on the hybrid tokenization approach, including citation information and authors. Update requirements to specify the version of the Turkish Tokenizer package.
881f336

A newer version of the Gradio SDK is available: 5.43.1

Upgrade
metadata
title: Turkish Tokenizer
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: 5.42.0
app_file: app.py
pinned: false
license: cc-by-nc-nd-4.0
short_description: Turkish Morphological Tokenizer

Turkish Tokenizer

A sophisticated Turkish text tokenizer with morphological analysis, built with Gradio for easy visualization and interaction.

Features

  • Morphological Analysis: Breaks down Turkish words into roots, suffixes, and BPE tokens
  • Visual Tokenization: Color-coded token display with interactive highlighting
  • Statistics Dashboard: Detailed analytics including compression ratios and token distribution
  • Real-time Processing: Instant tokenization with live statistics
  • Example Texts: Pre-loaded Turkish examples for testing

How to Use

  1. Enter Turkish text in the input field
  2. Click "🚀 Tokenize" to process the text
  3. View the color-coded tokens in the visualization
  4. Check the statistics for detailed analysis
  5. See the encoded token IDs and decoded text

Token Types

  • 🔴 Roots (ROOT): Base word forms
  • 🔵 Suffixes (SUFFIX): Turkish grammatical suffixes
  • 🟡 BPE: Byte Pair Encoding tokens for subword units

Examples

Try these example texts:

  • "Merhaba Dünya! Bu bir gelişmiş Türkçe tokenizer testidir."
  • "İstanbul'da yaşıyorum ve Türkçe dilini öğreniyorum."
  • "KitapOkumak çok güzeldir ve bilgi verir."
  • "Türkiye Cumhuriyeti'nin başkenti Ankara'dır."
  • "Yapay zeka ve makine öğrenmesi teknolojileri gelişiyor."

Technical Details

This tokenizer uses:

  • Custom morphological analysis for Turkish
  • JSON-based vocabulary files
  • Gradio for the web interface
  • Advanced tokenization algorithms

Research Paper

This implementation is based on the research paper:

"Tokens with Meaning: A Hybrid Tokenization Approach for NLP"

📄 arXiv:2508.14292

Authors: M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, Demircan Çelik

Abstract: A hybrid tokenization framework combining rule-based morphological analysis with statistical subword segmentation improves tokenization for morphologically rich languages like Turkish. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency.

Please cite this paper if you use this tokenizer in your research:

@article{bayram2024tokens,
  title={Tokens with Meaning: A Hybrid Tokenization Approach for NLP},
  author={Bayram, M. Ali and Fincan, Ali Arda and Gümüş, Ahmet Semih and Karakaş, Sercan and Diri, Banu and Yıldırım, Savaş and Çelik, Demircan},
  journal={arXiv preprint arXiv:2508.14292},
  year={2025},
  url={https://arxiv.org/abs/2508.14292}
}

Files

  • app.py: Main Gradio application
  • requirements.txt: Python dependencies

Local Development

To run locally:

pip install -r requirements.txt
python app.py

The app will be available at http://localhost:7860

Dependencies

  • gradio: Web interface framework
  • turkish-tokenizer: Core tokenization library

License

This project is open source and available under the MIT License.