Lutfiy: Southern Uzbek Machine Translation Model

This repository contains an initial machine translation model for the Southern Uzbek language, developed as part of the research paper "Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek".

Model details

Model	Tokenizer Length	Parameter Count
`lutfiy`	256,204	615M

Common attributes:

Base Model: nllb-200-600M
Languages: Southern Uzbek, Northern Uzbek, English

Intended uses & limitations

These models are designed for machine translation tasks involving the Southern Uzbek language. They can be used for translation between Southern Uzbek, Uzbek, or English.

How to use

You can use these models with the Transformers library. Here's a quick example:

Install `lutfiy` library for fixing ZWNJ

pip install lutfiy

from lutfiy import fix_zwnj
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_ckpt = "tahrirchi/lutfiy"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)

# Example translation
input_text = "O'zbekiston kelajagi buyuk davlatdir."

tokenizer.src_lang = "uzn_Latn"
tokenizer.tgt_lang = "uzs_Arab"

inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(fix_zwnj(translated_text)) # اۉزبېکستان کېلهجگی بویوک دولت دیر.

Training data

The models were trained on a parallel corpus of 40,000 sentence pairs, including:

Northern Uzbek - Southern Uzbek (37,415 pairs)
English - Southern Uzbek (2,579 pairs)

The dataset is available here.

Training procedure

For full details of the training procedure, please refer to our paper.

Citation

If you use these models in your research, please cite our paper:

@misc{mamasaidov2025fillinggapuzbekcreating,
      title={Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek}, 
      author={Mukhammadsaid Mamasaidov and Azizullah Aral and Abror Shopulatov and Mironshoh Inomjonov},
      year={2025},
      eprint={2508.14586},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.14586}, 
}

Contacts

We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Southern Uzbek.

For further development and issues about the dataset, please use [email protected] or [email protected] to contact.

tahrirchi
/

lutfiy

Lutfiy: Southern Uzbek Machine Translation Model

Model details

Intended uses & limitations

How to use

Install `lutfiy` library for fixing ZWNJ

Training data

Training procedure

Citation

Contacts

Model tree for tahrirchi/lutfiy

Dataset used to train tahrirchi/lutfiy

Collection including tahrirchi/lutfiy

lutfiy release

Lutfiy: Southern Uzbek Machine Translation Model

Model details

Intended uses & limitations

How to use

Install lutfiy library for fixing ZWNJ

Training data

Training procedure

Citation

Contacts

Model tree for tahrirchi/lutfiy

Dataset used to train tahrirchi/lutfiy

Collection including tahrirchi/lutfiy

Install `lutfiy` library for fixing ZWNJ