transhumanist-already-exists commited on
Commit
5cb49d2
·
verified ·
1 Parent(s): 66f2804

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -46,7 +46,7 @@ pretty_name: “ayayay - ukrainized aya tokenizer”
46
 
47
  ## Feature Overview:
48
 
49
- 1. +118,985 new Cyrillic BPE merge from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on full [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus). Keeping only sub-words that appear ≥ 4 000 times.
50
  2. Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact.
51
  3. Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding.
52
  4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one.
 
46
 
47
  ## Feature Overview:
48
 
49
+ 1. +118,985 new Cyrillic BPE tokens from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on full [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus). Keeping only sub-words that appear ≥ 4 000 times.
50
  2. Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact.
51
  3. Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding.
52
  4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one.