Dutch-Llama Tokenizer
Overview
The Dutch-Llama Tokenizer is a versatile tokenizer trained to handle a variety of languages and formats, including Dutch, English, Python code, Markdown, and general text. It's based on a dataset consisting of diverse sources, which ensures its capability to tokenize a wide range of text inputs effectively.
Dataset Composition
The tokenizer was trained on a comprehensive dataset, including:
- MC4 Dutch and English texts (195M)
- English and Dutch Wikipedia (278M and 356M, respectively)
- Dutch and English book datasets (211M and 355M, respectively)
- Dutch news articles (256M)
- CodeParrot GitHub Python code (158M)
- CodeSearchNet Python code (126M)
- Markdown files with math markup (5.8M)
- Arxiv scientific papers (169M)
Tokenizer Settings
The tokenizer was trained using the spm_train
command with the following settings:
- Model Type: Byte Pair Encoding (BPE)
- Vocab Size: 32,000
- Character Coverage: 100%
- Support for splitting digits and whitespace-only pieces
- Optimized for large corpus training
- Byte Fallback and language acceptance for Dutch (nl) and English (en)
- Special tokens and IDs for unknown, beginning of sentence, end of sentence, padding, and custom user-defined symbols
Installation
To use the Dutch-Llama Tokenizer, ensure you have Python 3.10.12 or later installed. Then, install the Transformers library from Hugging Face:
pip install transformers
Usage
First, import the AutoTokenizer
from the Transformers library and load the Dutch-Llama Tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("yhavinga/dutch-llama-tokenizer")
To tokenize text, use the tokenizer.tokenize
method. For converting tokens to IDs and decoding them back to text, use tokenizer.convert_tokens_to_ids
and tokenizer.decode
respectively:
# Example text
text = "Steenvliegen of oevervliegen[2] (Plecoptera) ๅไธบๅๅธMate60ๆๆบ"
# Tokenization and decoding
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
decoded_text = tokenizer.decode(token_ids)
print(decoded_text)
Dutch Tokenizer Arena
Compare the effectiveness of this tokenizer on different inputs at the Hugging Face Space: Dutch Tokenizer Arena.
Comparison with Other Tokenizers
The following table shows the number of tokens produced by the Dutch-Llama Tokenizer, the Mistral Tokenizer, the GroNLP GPT-2 Dutch Tokenizer, and the UL2 Dutch Tokenizer on a variety of inputs.
Input Type | Dutch LLama (32k) | Mistral (32k) | GroNLP GPT-2 Dutch (40k) | UL2 Dutch (32k) |
---|---|---|---|---|
Dutch news | 440 | 658 | 408 | 410 |
English news | 414 | 404 | 565 | 402 |
Code python | 566 | 582 | 767 | 639 (no newlines) |
LaTeX math | 491 | 497 | 717 | 666 (no newlines) |
Total | 1911 | 2141 | 2457 | 2117 |
๐ณ๐ฑ ๐ง๐ช๐๐