---
license: apache-2.0
language:
- tr
tags:
- turkish
---
# Turkish WordPiece Tokenizer

This repository contains a **WordPiece tokenizer** specifically trained on **1 billion Turkish sentences**, making it highly suitable for natural language processing (NLP) tasks in the Turkish language. The tokenizer has been built using the `tokenizers` library and includes both cased and uncased versions for flexibility.


## Repository Structure

| File Name                               | Description                                                                                      |
|-----------------------------------------|--------------------------------------------------------------------------------------------------|
| `special_tokens_map.json`               | Maps special tokens like `[UNK]`, `[PAD]`, `[CLS]`, and `[SEP]` to their respective identifiers. |
| `tokenizer_config.json`                 | Contains configuration details for the tokenizer, including model type and special token settings. |
| `turkish_wordpiece_tokenizer.json`      | The primary WordPiece tokenizer trained on 1 billion Turkish sentences (cased).                 |
| `turkish_wordpiece_tokenizer_uncased.json` | The uncased version of the WordPiece tokenizer.                                                 |
| `turkish_wordpiece_tokenizer_post_token_uncased.json` | The post-tokenization configuration for the uncased tokenizer.                                   |


## Features

- **WordPiece Tokenization**: Breaks words into subword units for better handling of rare or unseen words.
- **Support for Cased and Uncased Text**: Includes separate tokenizers for preserving case sensitivity and ignoring case.
- **Optimized for Turkish**: Trained on a large-scale Turkish dataset (1 billion sentences), ensuring strong coverage of Turkish vocabulary and grammar.
- **Special Tokens**: Includes commonly used tokens such as:
  - `[UNK]` (unknown token)
  - `[PAD]` (padding token)
  - `[CLS]` (classification token)
  - `[SEP]` (separator token)


## Usage

To use the tokenizer, you can load it with the Hugging Face `transformers` library or the `tokenizers` library.


### Loading with `tokenizers`:

```python
from tokenizers import Tokenizer

# Load the uncased tokenizer
tokenizer = Tokenizer.from_file("path/to/turkish_wordpiece_tokenizer_uncased.json")

# Tokenize a sentence
output = tokenizer.encode("Merhaba dünya!")
print(output.tokens)
```


## Tokenizer Training Details

- **Dataset**: 1 billion Turkish sentences, sourced from diverse domains (news, social media, literature, etc.).
- **Model**: WordPiece tokenizer, trained with a vocabulary size suitable for the Turkish language.
- **Uncased Variant**: Lowercases all text during tokenization to ignore case distinctions.


## Applications

- **Text Classification**
- **Machine Translation**
- **Question Answering**
- **Text Summarization**
- **Named Entity Recognition (NER)**


## Citation

If you use this tokenizer in your research or applications, please cite it as follows:

```
@misc{turkish_wordpiece_tokenizer,
  title={Turkish WordPiece Tokenizer},
  author={Mert Cobanov},
  year={2024},
  url={https://huggingface.co/mertcobanov/turkish-wordpiece-tokenizer}
}
```


## Contributions

Contributions are welcome! If you have suggestions or improvements, please create an issue or submit a pull request.


Let me know if you'd like further adjustments!