alibayram commited on
Commit
881f336
·
1 Parent(s): d809532

Enhance README with a new section detailing the research paper on the hybrid tokenization approach, including citation information and authors. Update requirements to specify the version of the Turkish Tokenizer package.

Browse files
Files changed (2) hide show
  1. README.md +24 -0
  2. requirements.txt +1 -1
README.md CHANGED
@@ -55,6 +55,30 @@ This tokenizer uses:
55
  - Gradio for the web interface
56
  - Advanced tokenization algorithms
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ## Files
59
 
60
  - `app.py`: Main Gradio application
 
55
  - Gradio for the web interface
56
  - Advanced tokenization algorithms
57
 
58
+ ## Research Paper
59
+
60
+ This implementation is based on the research paper:
61
+
62
+ **"Tokens with Meaning: A Hybrid Tokenization Approach for NLP"**
63
+
64
+ 📄 [arXiv:2508.14292](https://arxiv.org/abs/2508.14292)
65
+
66
+ **Authors:** M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, Demircan Çelik
67
+
68
+ **Abstract:** A hybrid tokenization framework combining rule-based morphological analysis with statistical subword segmentation improves tokenization for morphologically rich languages like Turkish. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency.
69
+
70
+ Please cite this paper if you use this tokenizer in your research:
71
+
72
+ ```bibtex
73
+ @article{bayram2024tokens,
74
+ title={Tokens with Meaning: A Hybrid Tokenization Approach for NLP},
75
+ author={Bayram, M. Ali and Fincan, Ali Arda and Gümüş, Ahmet Semih and Karakaş, Sercan and Diri, Banu and Yıldırım, Savaş and Çelik, Demircan},
76
+ journal={arXiv preprint arXiv:2508.14292},
77
+ year={2025},
78
+ url={https://arxiv.org/abs/2508.14292}
79
+ }
80
+ ```
81
+
82
  ## Files
83
 
84
  - `app.py`: Main Gradio application
requirements.txt CHANGED
@@ -1,2 +1,2 @@
1
  gradio
2
- turkish-tokenizer
 
1
  gradio
2
+ turkish-tokenizer==0.2.24