Readme and tokeniser update

Browse files

Files changed (3) hide show

README.md +2 -3
src/data/generate_cyr_lat_pairs.py +1 -1
src/tokeniser/tokenizer.json +0 -0

README.md CHANGED Viewed

@@ -121,7 +121,7 @@ print(output)
 KazParC деректер жинағын жүктеп алу үшін сізге Hugging Face есептік жазбасы қажет екенін ескеріңіз. Бұған қоса, жүктеп алуды бастау үшін өзіңізді аутентификациялау үшін «huggingface-cli» орнатуыңыз қажет. Бұл туралы толығырақ [мына жерден](https://huggingface.co/docs/huggingface_hub/en/guides/cli) оқыңыз / Please note that you'll need a Hugging Face account to download the KazParC dataset. Additionally, you'll need to install `huggingface-cli` to authenticate yourself for the download to commence. Read more about it [here](https://huggingface.co/docs/huggingface_hub/en/guides/cli).
-Егер сіз Windows жүйесінде болсаңыз, «get_data.sh» сценарийі жұмыс істемеуі мүмкін. Дегенмен, файлдағы сілтемелерді орындап, ондағы қадамдарды қолмен орындау арқылы әлі де деректерді алуға болады. Сол сияқты, «generate_clean_corpus.sh» файлында да қате пайда болады, бұл «kazakh_latin_corpus.json» файлындағы бос немесе бос жолдарды сүзу, сондай-ақ оны араластыру үшін Windows жүйесінің баламалы мүмкіндігін табуды талап етеді. Бұған қоса, `wikiextractor` бумасын алдын ала орнатқаныңызға сенімді болыңыз (нақты пайдаланылған нұсқаны `requirements.txt` файлынан табуға болады) / If you're on Windows, the `get_data.sh` script likely won't work. However, you can still get the data by following the links in the file and manually doing the steps in there. Likewise, `generate_clean_corpus.sh` will also error out, requiring you to find an equivalent Windows functionality to filter out blank or empty lines in the `kazakh_latin_corpus.json` file, as well as shuffle it. Additionally, be sure to install the `wikiextractor` package beforehand (the exact version used can be found in the `requirements.txt` file).
 ---
@@ -150,5 +150,4 @@ KazParC деректер жинағын жүктеп алу үшін сізге
   year = 2025,
   url = {https://huggingface.co/crossroderick/dalat5}
 }
-```

 KazParC деректер жинағын жүктеп алу үшін сізге Hugging Face есептік жазбасы қажет екенін ескеріңіз. Бұған қоса, жүктеп алуды бастау үшін өзіңізді аутентификациялау үшін «huggingface-cli» орнатуыңыз қажет. Бұл туралы толығырақ [мына жерден](https://huggingface.co/docs/huggingface_hub/en/guides/cli) оқыңыз / Please note that you'll need a Hugging Face account to download the KazParC dataset. Additionally, you'll need to install `huggingface-cli` to authenticate yourself for the download to commence. Read more about it [here](https://huggingface.co/docs/huggingface_hub/en/guides/cli).
+Егер сіз Windows жүйесінде болсаңыз, `get_data.sh` сценарийі жұмыс істемеуі мүмкін. Дегенмен, файлдағы сілтемелерді орындап, ондағы қадамдарды қолмен орындау арқылы әлі де деректерді алуға болады. Сол сияқты, `generate_clean_corpus.sh` файлында да қате пайда болады, бұл `kazakh_latin_corpus.json` файлындағы бос немесе бос жолдарды сүзу, сондай-ақ оны араластыру үшін Windows жүйесінің баламалы мүмкіндігін табуды талап етеді. Бұған қоса, `wikiextractor` бумасын алдын ала орнатқаныңызға сенімді болыңыз (нақты пайдаланылған нұсқаны `requirements.txt` файлынан табуға болады) / If you're on Windows, the `get_data.sh` script likely won't work. However, you can still get the data by following the links in the file and manually doing the steps in there. Likewise, `generate_clean_corpus.sh` will also error out, requiring you to find an equivalent Windows functionality to filter out blank or empty lines in the `kazakh_latin_corpus.json` file, as well as shuffle it. Additionally, be sure to install the `wikiextractor` package beforehand (the exact version used can be found in the `requirements.txt` file).
 ---
   year = 2025,
   url = {https://huggingface.co/crossroderick/dalat5}
 }
+```

src/data/generate_cyr_lat_pairs.py CHANGED Viewed

@@ -19,7 +19,7 @@ cyrillic_to_latin = {
     "Һ": "H", "һ": "h",
     "И": "I", "и": "i",      # used for [и], [й]
-    "І": "I", "і": "i",      # distinct from И in sound, both map to 'I/i'
     "Ж": "J", "ж": "j",
     "К": "K", "к": "k",

     "Һ": "H", "һ": "h",
     "И": "I", "и": "i",      # used for [и], [й]
+    "І": "I", "і": "ı",      # distinct from И in sound, both map to 'I/i'
     "Ж": "J", "ж": "j",
     "К": "K", "к": "k",

src/tokeniser/tokenizer.json CHANGED Viewed

The diff for this file is too large to render. See raw diff