--- license: apache-2.0 language: - tr tags: - turkish --- # Turkish WordPiece Tokenizer This repository contains a **WordPiece tokenizer** specifically trained on **1 billion Turkish sentences**, making it highly suitable for natural language processing (NLP) tasks in the Turkish language. The tokenizer has been built using the `tokenizers` library and includes both cased and uncased versions for flexibility. ## Repository Structure | File Name | Description | |-----------------------------------------|--------------------------------------------------------------------------------------------------| | `special_tokens_map.json` | Maps special tokens like `[UNK]`, `[PAD]`, `[CLS]`, and `[SEP]` to their respective identifiers. | | `tokenizer_config.json` | Contains configuration details for the tokenizer, including model type and special token settings. | | `turkish_wordpiece_tokenizer.json` | The primary WordPiece tokenizer trained on 1 billion Turkish sentences (cased). | | `turkish_wordpiece_tokenizer_uncased.json` | The uncased version of the WordPiece tokenizer. | | `turkish_wordpiece_tokenizer_post_token_uncased.json` | The post-tokenization configuration for the uncased tokenizer. | ## Features - **WordPiece Tokenization**: Breaks words into subword units for better handling of rare or unseen words. - **Support for Cased and Uncased Text**: Includes separate tokenizers for preserving case sensitivity and ignoring case. - **Optimized for Turkish**: Trained on a large-scale Turkish dataset (1 billion sentences), ensuring strong coverage of Turkish vocabulary and grammar. - **Special Tokens**: Includes commonly used tokens such as: - `[UNK]` (unknown token) - `[PAD]` (padding token) - `[CLS]` (classification token) - `[SEP]` (separator token) ## Usage To use the tokenizer, you can load it with the Hugging Face `transformers` library or the `tokenizers` library. ### Loading with `tokenizers`: ```python from tokenizers import Tokenizer # Load the uncased tokenizer tokenizer = Tokenizer.from_file("path/to/turkish_wordpiece_tokenizer_uncased.json") # Tokenize a sentence output = tokenizer.encode("Merhaba dünya!") print(output.tokens) ``` ## Tokenizer Training Details - **Dataset**: 1 billion Turkish sentences, sourced from diverse domains (news, social media, literature, etc.). - **Model**: WordPiece tokenizer, trained with a vocabulary size suitable for the Turkish language. - **Uncased Variant**: Lowercases all text during tokenization to ignore case distinctions. ## Applications - **Text Classification** - **Machine Translation** - **Question Answering** - **Text Summarization** - **Named Entity Recognition (NER)** ## Citation If you use this tokenizer in your research or applications, please cite it as follows: ``` @misc{turkish_wordpiece_tokenizer, title={Turkish WordPiece Tokenizer}, author={Mert Cobanov}, year={2024}, url={https://huggingface.co/mertcobanov/turkish-wordpiece-tokenizer} } ``` ## Contributions Contributions are welcome! If you have suggestions or improvements, please create an issue or submit a pull request. Let me know if you'd like further adjustments!