Commit
·
1f2ae8d
1
Parent(s):
3cf1937
Slight readme update on the CC100 dataset part
Browse files
README.md
CHANGED
@@ -106,7 +106,7 @@ print(output)
|
|
106 |
|
107 |
Тәуелсіз жоба болғанына қарамастан, DalaT5 өте маңызды үш деректер жиынтығын пайдаланады / Despite being an independent project, DalaT5 makes use of three very important datasets:
|
108 |
|
109 |
-
- The first ~2 million records of the Kazakh subset of the CC100 dataset by [Conneau et al. (2020)](https://paperswithcode.com/paper/unsupervised-cross-lingual-representation-1)
|
110 |
- The raw, Kazakh-focused part of the [Kazakh Parallel Corpus (KazParC)](https://huggingface.co/datasets/issai/kazparc) from Nazarbayev University's Institute of Smart Systems and Artificial Intelligence (ISSAI), graciously made available on Hugging Face
|
111 |
- The Wikipedia dump of articles in the Kazakh language, obtained via the `wikiextractor` Python package
|
112 |
|
|
|
106 |
|
107 |
Тәуелсіз жоба болғанына қарамастан, DalaT5 өте маңызды үш деректер жиынтығын пайдаланады / Despite being an independent project, DalaT5 makes use of three very important datasets:
|
108 |
|
109 |
+
- The first ~2.2 million records of the Kazakh subset of the CC100 dataset by [Conneau et al. (2020)](https://paperswithcode.com/paper/unsupervised-cross-lingual-representation-1)
|
110 |
- The raw, Kazakh-focused part of the [Kazakh Parallel Corpus (KazParC)](https://huggingface.co/datasets/issai/kazparc) from Nazarbayev University's Institute of Smart Systems and Artificial Intelligence (ISSAI), graciously made available on Hugging Face
|
111 |
- The Wikipedia dump of articles in the Kazakh language, obtained via the `wikiextractor` Python package
|
112 |
|