Update README.md
Browse files
README.md
CHANGED
@@ -46,7 +46,7 @@ pretty_name: “ayayay - ukrainized aya tokenizer”
|
|
46 |
|
47 |
## Feature Overview:
|
48 |
|
49 |
-
1. +118,985 new Cyrillic BPE
|
50 |
2. Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact.
|
51 |
3. Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding.
|
52 |
4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one.
|
|
|
46 |
|
47 |
## Feature Overview:
|
48 |
|
49 |
+
1. +118,985 new Cyrillic BPE tokens from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on full [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus). Keeping only sub-words that appear ≥ 4 000 times.
|
50 |
2. Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact.
|
51 |
3. Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding.
|
52 |
4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one.
|