transhumanist-already-exists
/

ayayay-tokenizer

transhumanist-already-exists commited on 6 days ago

Commit

5cb49d2

verified ·

1 Parent(s): 66f2804

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -46,7 +46,7 @@ pretty_name: “ayayay - ukrainized aya tokenizer”
 ## Feature Overview:
-1. +118,985 new Cyrillic BPE merge from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on full [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus). Keeping only sub-words that appear ≥ 4 000 times.
 2. Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact.
 3. Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding.
 4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one.

 ## Feature Overview:
+1. +118,985 new Cyrillic BPE tokens from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on full [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus). Keeping only sub-words that appear ≥ 4 000 times.
 2. Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact.
 3. Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding.
 4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one.