Ayayay — Malyuk-powered Ukrainianization for the Aya-Expanse Tokenizer

Ayayay is the first tokenizer that makes Ukrainian the core language in a multilingual vocabulary — while retaining as much compatibility with the original Aya-Expanse tokenizer as possible through careful (partially manual) token remapping.

Feature Overview:

+118,985 new Cyrillic BPE tokens from malyuk_qirim_tokenizer.json trained on full Malyuk Ukrainian corpus plus the Cyrillic slice of the Crimean Tatar corpus. Keeping only sub-words that appear ≥ 4 000 times.
Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact.
Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding.
Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one.

Simple example

tokenizer = AutoTokenizer.from_pretrained(
    "transhumanist-already-exists/ayayay-tokenizer"
)
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids) # [123903, 175118, 167580, 196099] - only 4 tokens 💪🏻

Metrics

Acknowledgement: evaluation results provided by @Sofetory.

	lang-uk/malyuk	100k texts	allenai/c4(en)	100k texts	allenai/c4(es, fr, it, de)	400k texts	QIRIM/crh_monocorpus(Cyrillic)	94 texts	allenai/c4(ru)	100k texts	allenai/c4(bg)	100k texts	allenai/c4(be)	100k texts
words count	22,898,164		36,170,971		198,173,216		1,868,259		42,557,519		44,627,199		43,153,645

tokenizers	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word
google/gemma-3-12b-it	57,388,402	2.506	47,285,432	1.307	354,241,840	1.788	6,240,944	3.341	95,520,817	2.245	103,950,626	2.329	131,398,147	3.045
Qwen/Qwen3-8B	84,408,084	3.686	46,884,593	1.296	395,581,536	1.996	7,956,741	4.259	116,115,062	2.728	132,597,427	2.971	173,571,099	4.022
meta-llama/Llama-3.1-8B-Instruct	57,226,997	2.499	46,085,724	1.274	382,143,751	1.928	7,386,873	3.954	104,974,733	2.467	119,123,733	2.669	150,189,294	3.48
microsoft/Phi-4-mini-instruct	59,447,036	2.596	45,423,925	1.256	335,188,687	1.691	5,995,822	3.209	91,824,464	2.158	102,472,523	2.296	119,587,038	2.771
CohereLabs/aya-expanse-8b	50,973,632	2.226	47,364,187	1.309	353,221,932	1.782	6,614,719	3.541	93,089,697	2.187	112,612,668	2.523	141,262,943	3.273
ayayay-tokenizer (Ours)	37,094,157	1.62🤩	48,288,882	1.335	372,587,959	1.88	4,238,587	2.269	107,331,167	2.522	114,292,191	2.561	133,618,186	3.096
Comments	Significant 27 % improvement over the Aya-Expanse baseline; absolute leader in Ukrainian tokenization.		Tokens-per-word for English rises by less than 4 % compared with the baseline.		Ayayay tokenizer retains strong multilingual capabilities		Shows significant improvement on QIRIM Cyrillic versus the original aya and other tokenizers		Russian efficiency drops, owing to the Ukrainian-centric changes, but still beats Qwen.		Other Cyrillic languages, such as Bulgarian and Belarusian, perform well after the token replacement; Belarusian improves especially noticeably.

tokenizer.json: Byte‐level tokenizer spec (vocab, merges, model settings).
tokenizer_utf8.json: Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.
malyuk_qirim_tokenizer.json: Aya-style tokenizer trained on the full Malyuk Ukrainian corpus plus Cyrillic QIRIM (100 : 1 ratio), with min_frequency = 4_000.
merge_info.json: Lists the replaced Aya token IDs and the IDs of the added Malyuk tokens in malyuk_qirim_tokenizer.
tokenizer_config.json: Configuration metadata.
special_tokens_map.json: Mapping of special token (The same with Aya).

Initialisation of embeddings for new tokens in Aya-Expanse models

Some tokens are identical to those in the original Aya-Expanse tokenizer. For the newly added tokens, you can initialise embeddings with tools such as Focus and Zett. The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.

Citation

BibTeX:

@misc{zaduha2025post9164,
  author       = "{Bohdan Didenko}",
  title        = "{Post \#9164 on Telegram Channel Zaduha}",
  howpublished = "\url{https://t.me/zaduha/9164}",
  month        = june,
  year         = {2025},
  note         = "[Online; accessed 8 June 2025]"
}

transhumanist-already-exists
/

ayayay-tokenizer