Lutfiy: Southern Uzbek Machine Translation Model
This repository contains an initial machine translation model for the Southern Uzbek language, developed as part of the research paper "Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek".
Model details
Model | Tokenizer Length | Parameter Count |
---|---|---|
lutfiy |
256,204 | 615M |
Common attributes:
- Base Model: nllb-200-600M
- Languages: Southern Uzbek, Northern Uzbek, English
Intended uses & limitations
These models are designed for machine translation tasks involving the Southern Uzbek language. They can be used for translation between Southern Uzbek, Uzbek, or English.
How to use
You can use these models with the Transformers library. Here's a quick example:
Install lutfiy
library for fixing ZWNJ
pip install lutfiy
from lutfiy import fix_zwnj
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_ckpt = "tahrirchi/lutfiy"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)
# Example translation
input_text = "O'zbekiston kelajagi buyuk davlatdir."
tokenizer.src_lang = "uzn_Latn"
tokenizer.tgt_lang = "uzs_Arab"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(fix_zwnj(translated_text)) # اۉزبېکستان کېلهجگی بویوک دولت دیر.
Training data
The models were trained on a parallel corpus of 40,000 sentence pairs, including:
- Northern Uzbek - Southern Uzbek (37,415 pairs)
- English - Southern Uzbek (2,579 pairs)
The dataset is available here.
Training procedure
For full details of the training procedure, please refer to our paper.
Citation
If you use these models in your research, please cite our paper:
@misc{mamasaidov2025fillinggapuzbekcreating,
title={Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek},
author={Mukhammadsaid Mamasaidov and Azizullah Aral and Abror Shopulatov and Mironshoh Inomjonov},
year={2025},
eprint={2508.14586},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.14586},
}
Contacts
We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Southern Uzbek.
For further development and issues about the dataset, please use [email protected] or [email protected] to contact.
- Downloads last month
- 36
Model tree for tahrirchi/lutfiy
Base model
facebook/nllb-200-distilled-600M