Dutch-Llama Tokenizer

Overview

The Dutch-Llama Tokenizer is a versatile tokenizer trained to handle a variety of languages and formats, including Dutch, English, Python code, Markdown, and general text. It's based on a dataset consisting of diverse sources, which ensures its capability to tokenize a wide range of text inputs effectively.

Dataset Composition

The tokenizer was trained on a comprehensive dataset, including:

  • MC4 Dutch and English texts (195M)
  • English and Dutch Wikipedia (278M and 356M, respectively)
  • Dutch and English book datasets (211M and 355M, respectively)
  • Dutch news articles (256M)
  • CodeParrot GitHub Python code (158M)
  • CodeSearchNet Python code (126M)
  • Markdown files with math markup (5.8M)
  • Arxiv scientific papers (169M)

Tokenizer Settings

The tokenizer was trained using the spm_train command with the following settings:

  • Model Type: Byte Pair Encoding (BPE)
  • Vocab Size: 32,000
  • Character Coverage: 100%
  • Support for splitting digits and whitespace-only pieces
  • Optimized for large corpus training
  • Byte Fallback and language acceptance for Dutch (nl) and English (en)
  • Special tokens and IDs for unknown, beginning of sentence, end of sentence, padding, and custom user-defined symbols

Installation

To use the Dutch-Llama Tokenizer, ensure you have Python 3.10.12 or later installed. Then, install the Transformers library from Hugging Face:

pip install transformers

Usage

First, import the AutoTokenizer from the Transformers library and load the Dutch-Llama Tokenizer:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("yhavinga/dutch-llama-tokenizer")

To tokenize text, use the tokenizer.tokenize method. For converting tokens to IDs and decoding them back to text, use tokenizer.convert_tokens_to_ids and tokenizer.decode respectively:

# Example text
text = "Steenvliegen of oevervliegen[2] (Plecoptera) ๅŽไธบๅ‘ๅธƒMate60ๆ‰‹ๆœบ"

# Tokenization and decoding
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
decoded_text = tokenizer.decode(token_ids)

print(decoded_text)

Dutch Tokenizer Arena

Compare the effectiveness of this tokenizer on different inputs at the Hugging Face Space: Dutch Tokenizer Arena.

Comparison with Other Tokenizers

The following table shows the number of tokens produced by the Dutch-Llama Tokenizer, the Mistral Tokenizer, the GroNLP GPT-2 Dutch Tokenizer, and the UL2 Dutch Tokenizer on a variety of inputs.

Input Type Dutch LLama (32k) Mistral (32k) GroNLP GPT-2 Dutch (40k) UL2 Dutch (32k)
Dutch news 440 658 408 410
English news 414 404 565 402
Code python 566 582 767 639 (no newlines)
LaTeX math 491 497 717 666 (no newlines)
Total 1911 2141 2457 2117

๐Ÿ‡ณ๐Ÿ‡ฑ ๐Ÿ‡ง๐Ÿ‡ช๐Ÿ๐Ÿ“

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .

Space using yhavinga/dutch-llama-tokenizer 1