Dutch-Llama Tokenizer

Overview

The Dutch-Llama Tokenizer is a versatile tokenizer trained to handle a variety of languages and formats, including Dutch, English, Python code, Markdown, and general text. It's based on a dataset consisting of diverse sources, which ensures its capability to tokenize a wide range of text inputs effectively.

Dataset Composition

The tokenizer was trained on a comprehensive dataset, including:

MC4 Dutch and English texts (195M)
English and Dutch Wikipedia (278M and 356M, respectively)
Dutch and English book datasets (211M and 355M, respectively)
Dutch news articles (256M)
CodeParrot GitHub Python code (158M)
CodeSearchNet Python code (126M)
Markdown files with math markup (5.8M)
Arxiv scientific papers (169M)

Tokenizer Settings

The tokenizer was trained using the spm_train command with the following settings:

Model Type: Byte Pair Encoding (BPE)
Vocab Size: 32,000
Character Coverage: 100%
Support for splitting digits and whitespace-only pieces
Optimized for large corpus training
Byte Fallback and language acceptance for Dutch (nl) and English (en)
Special tokens and IDs for unknown, beginning of sentence, end of sentence, padding, and custom user-defined symbols

Installation

To use the Dutch-Llama Tokenizer, ensure you have Python 3.10.12 or later installed. Then, install the Transformers library from Hugging Face:

pip install transformers

Usage

First, import the AutoTokenizer from the Transformers library and load the Dutch-Llama Tokenizer:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("yhavinga/dutch-llama-tokenizer")

To tokenize text, use the tokenizer.tokenize method. For converting tokens to IDs and decoding them back to text, use tokenizer.convert_tokens_to_ids and tokenizer.decode respectively:

# Example text
text = "Steenvliegen of oevervliegen[2] (Plecoptera) 华为发布Mate60手机"

# Tokenization and decoding
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
decoded_text = tokenizer.decode(token_ids)

print(decoded_text)

Dutch Tokenizer Arena

Compare the effectiveness of this tokenizer on different inputs at the Hugging Face Space: Dutch Tokenizer Arena.

Comparison with Other Tokenizers

The following table shows the number of tokens produced by the Dutch-Llama Tokenizer, the Mistral Tokenizer, the GroNLP GPT-2 Dutch Tokenizer, and the UL2 Dutch Tokenizer on a variety of inputs.

Input Type	Dutch LLama (32k)	Mistral (32k)	GroNLP GPT-2 Dutch (40k)	UL2 Dutch (32k)
Dutch news	440	658	408	410
English news	414	404	565	402
Code python	566	582	767	639 (no newlines)
LaTeX math	491	497	717	666 (no newlines)
Total	1911	2141	2457	2117

🇳🇱 🇧🇪🐍📐

yhavinga
/

dutch-llama-tokenizer