dalat5

File size: 14,884 Bytes

cb301d1
67061ca
cb301d1
 
03c9e83
 
 
 
 
 
 
a15b7d9
03c9e83
 
 
 
 
f0c2036
 
 
 
 
 
 
 
 
3cf1937
96f0b49
 
3cf1937
cb301d1
0209b4e
cb301d1
 
 
49af62d
cb301d1
03c9e83
cb301d1
70fdfe0
 
4936083
 
2476d6b
70fdfe0
cb301d1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
252a85f
cb301d1
323adfd
cb301d1
 
 
 
 
 
 
 
b969b85
cb301d1
 
 
 
efe3e02
 
252a85f
03c9e83
252a85f
03c9e83
1f2ae8d
252a85f
 
 
 
 
52fd15f
03c9e83
 
 
6b8b15a
32a6080
252a85f
61732cf
252a85f
1a3d32f
03c9e83
da9c20a
6cbc4c0
cb301d1
 
252a85f
 
f14fd41
252a85f
f14fd41
252a85f
f14fd41
252a85f
f14fd41
178501c
f14fd41
 
3cf1937
252a85f
3cf1937
 
 
96f0b49
252a85f
 
 
cb301d1
252a85f
cb301d1
 
712ca44
cb301d1
93400b2
cb301d1
72c9c3e
 
e329f2e
cb301d1
8a2143a

---
base_model: t5-small
license: mit
language: kaz
tags:
  - text2text-generation
  - transliteration
  - kazakh
  - low-resource
  - cultural-nlp
  - t5
pipeline_tag: text2text-generation
widget:
  - text: "Cyrillic2Latin: Мен қазақ тілінде сөйлеймін."
model-index:
- name: DalaT5
  results:
    - task:
        name: Transliteration
        type: text2text-generation
      dataset:
        name: Kazakh Cyrillic–Latin Transliteration Corpus
        type: custom
      metrics:
        - name: Training Loss
          type: loss
          value: 0.6684
        - name: Evaluation Loss
          type: loss
          value: 0.0886
---
# DalaT5 - T5 Fine-Tuned on Cyrillic-to-Latin Kazakh 🇰🇿

> 'Dala' means 'steppe' in Kazakh - a nod to where the voice of this model might echo.

**DalaT5** is a fine-tuned version of `t5-small`, trained to **transliterate Kazakh text written in Cyrillic** into **Latin script** using the officially adopted [2021 alphabet reform](https://astanatimes.com/2021/02/kazakhstan-presents-new-latin-alphabet-plans-gradual-transition-through-2031/).

Unlike language models that *generate* creatively, DalaT5 is trained as a **faithful transliterator** - preserving content while transforming form. It is also meant to serve as a **foundational model** to be improved upon as needed.

⚠️ Limitations
- May produce unexpected outputs for very short inputs or mixed-script text
- Accuracy may vary across dialects or uncommon characters

DalaT5 is, from an architectural standpoint, mostly complete. Further updates will be conducted in a continuous improvement/deployment format to ensure better generalisation and make additional evaluation scripts and metrics available.

---

## 🧠 Purpose

This model wasn’t built for production-grade translation or for linguistic study alone.

It was born from something else:
- A deep **respect for Kazakh culture**
- A desire to let its **future alphabet speak**
- A belief that **languages deserve continuity** - even through code

> *Though I am not Kazakh by birth, I wanted Kazakh to have a voice among the languages of the future - in its new script, as a symbol of memory and continuity.*

---

## 🌍 Жоба туралы / About the Project

### 🏕 Қазақша

**DalaT5** - T5 моделінің негізінде жасалған тәжірибелік жоба. Ол **қазақ мәтінін кирилл жазуынан** **латын графикасына** аударады.

Бұл жоба:
- Ресми 2021 латын әліпбиіне негізделген  
- Қолдануға, дамытуға және шабыт алуға ашық  
- Шетел азаматының ниетпен жасаған еңбегі

> *Қазақ емеспін, бірақ осы тіл мені сезіндіріп отыр. Бұл модель - құрмет пен махаббаттың нәтижесі.*

---

### 🌐 English

**DalaT5** is a transformer fine-tuned on Kazakh Cyrillic–to–Latin data, designed to support Kazakhstan’s national script reform. The model focuses on script conversion, not translation, making it ideal for educational tools and linguistic preservation.

This project:
- Supports **underrepresented languages** in AI  
- Offers **open access** to the Latinised future of Kazakh  
- Was created by a foreigner - with humility, curiosity, and deep care

---

## 💻 Байқап көріңіз / Try it out

Құшақтап тұрған бет арқылы тікелей пайдаланыңыз 🤗 Трансформерлер / Use directly via Hugging Face 🤗 Transformers:

```python
from transformers import pipeline

pipe = pipeline("text2text-generation", model = "crossroderick/dalat5")

text = "Мен қазақ тілінде сөйлеймін."
input_text = f"Cyrillic2Latin: {text}"
output = pipe(input_text, max_length = 128)[0]["generated_text"]

print(output)
```

---

## 🙏 Алғыс / Acknowledgements

Тәуелсіз жоба болғанына қарамастан, DalaT5 өте маңызды үш деректер жиынтығын пайдаланады / Despite being an independent project, DalaT5 makes use of three very important datasets:

- The first ~2.2 million records of the Kazakh subset of the CC100 dataset by [Conneau et al. (2020)](https://paperswithcode.com/paper/unsupervised-cross-lingual-representation-1)
- The raw, Kazakh-focused part of the [Kazakh Parallel Corpus (KazParC)](https://huggingface.co/datasets/issai/kazparc) from Nazarbayev University's Institute of Smart Systems and Artificial Intelligence (ISSAI), graciously made available on Hugging Face
- The Wikipedia dump of articles in the Kazakh language, obtained via the `wikiextractor` Python package

---

## 🤖 Нақты баптау нұсқаулары / Fine-tuning instructions

Деректер жиынының жалпы өлшемін ескере отырып, олар осы үлгінің репозиторийіне қосылмаған. Дегенмен, DalaT5-ті өзіңіз дәл баптағыңыз келсе, келесі әрекеттерді орындаңыз / Given the total size of the datasets, they haven't been included in this model's repository. However, should you wish to fine-tune DalaT5 yourself, please do the following:

1. `get_data.sh` қабық сценарий файлын "src/data" қалтасында іске қосыңыз / Run the `get_data.sh` shell script file in the "src/data" folder
2. Сол қалтадағы `generate_cyr_lat_pairs.py` файлын іске қосыңыз / Run the `generate_cyr_lat_pairs.py` file in the same folder 
3. Қазақ корпус файлын тазалау және деректер жинағын араластыру үшін `generate_clean_corpus.sh` іске қосыңыз / Run `generate_clean_corpus.sh` to clean the Kazakh corpus file and shuffle the dataset
4. Токенизаторды тазартылған корпусқа үйрету үшін `train_tokeniser.py` іске қосыңыз / Run `train_tokeniser.py` to train the tokeniser on the cleaned corpus

KazParC деректер жинағын жүктеп алу үшін сізге Hugging Face есептік жазбасы қажет екенін ескеріңіз. Бұған қоса, жүктеп алуды бастау үшін өзіңізді аутентификациялау үшін `huggingface-cli` орнатуыңыз қажет. Бұл туралы толығырақ [мына жерден](https://huggingface.co/docs/huggingface_hub/en/guides/cli) оқыңыз / Please note that you'll need a Hugging Face account to download the KazParC dataset. Additionally, you'll need to install `huggingface-cli` to authenticate yourself for the download to commence. Read more about it [here](https://huggingface.co/docs/huggingface_hub/en/guides/cli). 

Егер сіз Windows жүйесінде болсаңыз, `get_data.sh` сценарийі жұмыс істемеуі мүмкін. Дегенмен, файлдағы сілтемелерді орындап, ондағы қадамдарды қолмен орындау арқылы әлі де деректерді алуға болады. Сол сияқты, `generate_clean_corpus.sh` файлында да қате пайда болады, бұл `kazakh_latin_corpus.json` файлындағы бос немесе бос жолдарды сүзу, сондай-ақ оны араластыру үшін Windows жүйесінің баламалы мүмкіндігін табуды талап етеді. Бұған қоса, `wikiextractor` және `sentencepiece` бумаларын алдын ала орнатуды ұмытпаңыз (нақты нұсқаларды `requirements.txt` файлынан табуға болады) / If you're on Windows, the `get_data.sh` script likely won't work. However, you can still get the data by following the links in the file and manually doing the steps in there. Likewise, `generate_clean_corpus.sh` will also error out, requiring you to find an equivalent Windows functionality to filter out blank or empty lines in the `kazakh_latin_corpus.json` file, as well as shuffle it. Additionally, be sure to install the `wikiextractor` and `sentencepiece` packages beforehand (the exact versions can be found in the `requirements.txt` file).

---

## 📋 Өзгеріс журналы / Changelog

* **DalaT5 v1:** 13 сәуірде дәл реттелген, 13 сәуірде қолжетімді болды. Жаттығу үшін ~38 мың деректер жазбасы пайдаланылды. Дисперсиясы жоғары және үлгі сенімділігі төмен бастапқы нұсқа / Fine-tuned on April 13 and made available on the same day. Used ~38k data records for training. Initial version with high variance and low model confidence

* **DalaT5 v2:** 18 сәуірде дәл реттелген және сол күні қолжетімді болды. Жаттығу үшін ~1 миллион деректер жазбасы пайдаланылды. Деректердің көп болуының арқасында әлдеқайда жақсы өнімділікті көрсеткен екінші итерация / Fine-tuned on April 18 and made available on the same day. Used ~1 million data records for training. Second iteration that exhibited much better performance owing to more data availability

* **DalaT5 v3**: 20 сәуірде дәл реттелген және сол күні қолжетімді болды. Жаттығу үшін ~1,6 миллион деректер жазбасы пайдаланылды. Үшінші итерация одан әрі жақсартуларды, сондай-ақ белгілі бір дәрежеде семантикалық түсінуді көрсетті / Fine-tuned on April 20 and made available on the same day. Used ~1.6 million data records for training. Third iteration that showed further improvements, as well as some degree of semantic understanding

* **DalaT5 v4**: 23 сәуірде нақтыланған және сол күні қолжетімді болды. Жаттығу үшін ~1,9 миллион жазба (Wikipedia dump + CC100 + KazParC) пайдаланылды. Семантикалық түсініктің жоғарылауын көрсететін төртінші итерация / Fine-tuned on April 23 and made available on the same day. Used ~1.9 million records (Wikipedia dump + CC100 + KazParC) for training. Fourth iteration that showed increased semantic understanding

* **DalaT5 v5**: 25 сәуірде дәл реттелген және сол күні қолжетімді болды. Қазақ кириллица және латын графикасын жақсырақ өңдеу үшін өзінің жеке токенизаторы бар ~1,9 миллион жазба (v4 сияқты) пайдаланылды / Fine-tuned on April 25 and made available on the same day. Used ~1.9 million records (like v4) with its own tokeniser to better handle the Kazakh Cyrillic and Latin scripts 

  * **DalaT5 v5.1**: 25 сәуірде (v5 нұсқасынан кейін бірден) дәл реттелген және сол күні қолжетімді болды. Жақсырақ жалпылауды қамтамасыз ету үшін жаттығу үшін ~2,2 миллион жазба және токенизатор үшін 1 миллион жазба пайдаланылды. v5-пен салыстырғанда галлюцинациялар күрт төмендеп, семантикалық түсіну одан әрі жақсарды / Fine-tuned on April 25 (immediately after v5) and made available on the same day. Used ~2.2 million records for training and 1 million records for the tokeniser to ensure better generalisation. Hallucinations decreased drastically when compared to v5, and semantic understanding was further enhanced

  * **DalaT5 v5.2**: 25 сәуірде (v5 нұсқасынан кейін бірден) дәл реттелген және сол күні қолжетімді болды. Жақсырақ жалпылауды қамтамасыз ету үшін жаттығу үшін ~2,2 миллион жазба және токенизатор үшін 1 миллион жазба пайдаланылды. v5-пен салыстырғанда галлюцинациялар күрт төмендеп, семантикалық түсіну одан әрі жақсарды / Fine-tuned on April 26 and made available on the same day. Used the same tokeniser structure overall, but leveraged ~2.4 million records for training and evaluation. The evaluation loss was also made available with this version. Overall, hallucinations that made it through v5.1 were almost completely eliminated

  * **DalaT5 v5.3**: 28 сәуірде дәл реттелген және сол күні қолжетімді болды. Жалпы, токенизаторға арналған ұлғайтылған максималды сөйлем өлшемінен басқа (4192 орнына 8384), ол v5.2 сияқты құрылымды пайдаланды. Бұл нұсқа одан да жақсы жалпылауды қамтамасыз ету үшін оқыту және бағалау үшін ~2,6 миллион жазбаны пайдаланды. Галлюцинацияның одан әрі азаюы байқалды және модель қазір қазақ морфологиясын өңдеуге шебер болған сияқты. / Fine-tuned on April 28 and made available on the same day. Overall, other than an increased maximum sentence size for the tokeniser (8384 instead of 4192), it used the same structure as v5.2. This version leveraged ~2.6 million records for training and evaluation to ensure even better generalisation. Further reduction of hallucinations was observed, and the model now seems to have become adept at handling Kazakh morphology

---

## 📚 Несиелер / Credits

Егер сіз DalaT5-ті туынды жұмыстарды зерттеуде қолдансаңыз - біріншіден, рахмет. Екіншіден, егер сіз қаласаңыз, дәйексөз келтіріңіз / If you use DalaT5 in research of derivative works - first off, thank you. Secondly, should you be willing, feel free to cite:

```
@misc{pereira_cruz_dalat5_2025,
  author = {Rodrigo Pereira Cruz},
  title = {DalaT5: Cyrillic-to-Latin Kazakh transliterator on fine-tuned T5},
  year = 2025,
  url = {https://huggingface.co/crossroderick/dalat5},
  doi = {10.57967/hf/5255},
  publisher = {Hugging Face}
}
```