Ayayay — Malyuk-powered Ukrainianization for the Aya-Expanse Tokenizer

Ayayay is the first tokenizer that makes Ukrainian the core language in a multilingual vocabulary — while retaining as much compatibility with the original Aya-Expanse tokenizer as possible through careful (partially manual) token remapping.

Feature Overview:

  1. +118,985 new Cyrillic BPE tokens from malyuk_qirim_tokenizer.json trained on full Malyuk Ukrainian corpus plus the Cyrillic slice of the Crimean Tatar corpus. Keeping only sub-words that appear ≥ 4 000 times.
  2. Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact.
  3. Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding.
  4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one.

Simple example

tokenizer = AutoTokenizer.from_pretrained(
    "transhumanist-already-exists/ayayay-tokenizer"
)
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids) # [123903, 175118, 167580, 196099] - only 4 tokens 💪🏻

Metrics

Acknowledgement: evaluation results provided by @Sofetory.

lang-uk/malyuk 100k texts allenai/c4(en) 100k texts allenai/c4(es, fr, it, de) 400k texts QIRIM/crh_monocorpus(Cyrillic) 94 texts allenai/c4(ru) 100k texts allenai/c4(bg) 100k texts allenai/c4(be) 100k texts
words count 22,898,164 36,170,971 198,173,216 1,868,259 42,557,519 44,627,199 43,153,645
tokenizers tokens toks/word tokens toks/word tokens toks/word tokens toks/word tokens toks/word tokens toks/word tokens toks/word
google/gemma-3-12b-it 57,388,402 2.506 47,285,432 1.307 354,241,840 1.788 6,240,944 3.341 95,520,817 2.245 103,950,626 2.329 131,398,147 3.045
Qwen/Qwen3-8B 84,408,084 3.686 46,884,593 1.296 395,581,536 1.996 7,956,741 4.259 116,115,062 2.728 132,597,427 2.971 173,571,099 4.022
meta-llama/Llama-3.1-8B-Instruct 57,226,997 2.499 46,085,724 1.274 382,143,751 1.928 7,386,873 3.954 104,974,733 2.467 119,123,733 2.669 150,189,294 3.48
microsoft/Phi-4-mini-instruct 59,447,036 2.596 45,423,925 1.256 335,188,687 1.691 5,995,822 3.209 91,824,464 2.158 102,472,523 2.296 119,587,038 2.771
CohereLabs/aya-expanse-8b 50,973,632 2.226 47,364,187 1.309 353,221,932 1.782 6,614,719 3.541 93,089,697 2.187 112,612,668 2.523 141,262,943 3.273
ayayay-tokenizer (Ours) 37,094,157 1.62🤩 48,288,882 1.335 372,587,959 1.88 4,238,587 2.269 107,331,167 2.522 114,292,191 2.561 133,618,186 3.096
Comments Significant 27 % improvement over the Aya-Expanse baseline; absolute leader in Ukrainian tokenization.Tokens-per-word for English rises by less than 4 % compared with the baseline.Ayayay tokenizer retains strong multilingual capabilities Shows significant improvement on QIRIM Cyrillic versus the original aya and other tokenizersRussian efficiency drops, owing to the Ukrainian-centric changes, but still beats Qwen. Other Cyrillic languages, such as Bulgarian and Belarusian, perform well after the token replacement; Belarusian improves especially noticeably.

Contents

Initialisation of embeddings for new tokens in Aya-Expanse models

Some tokens are identical to those in the original Aya-Expanse tokenizer. For the newly added tokens, you can initialise embeddings with tools such as Focus and Zett. The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.

Citation

BibTeX:

@misc{zaduha2025post9164,
  author       = "{Bohdan Didenko}",
  title        = "{Post \#9164 on Telegram Channel Zaduha}",
  howpublished = "\url{https://t.me/zaduha/9164}",
  month        = june,
  year         = {2025},
  note         = "[Online; accessed 8 June 2025]"
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for transhumanist-already-exists/ayayay-tokenizer

Finetuned
(2)
this model

Datasets used to train transhumanist-already-exists/ayayay-tokenizer