Lutfiy: Southern Uzbek Machine Translation Model

This repository contains an initial machine translation model for the Southern Uzbek language, developed as part of the research paper "Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek".

Model details

Model Tokenizer Length Parameter Count
lutfiy 256,204 615M

Common attributes:

  • Base Model: nllb-200-600M
  • Languages: Southern Uzbek, Northern Uzbek, English

Intended uses & limitations

These models are designed for machine translation tasks involving the Southern Uzbek language. They can be used for translation between Southern Uzbek, Uzbek, or English.

How to use

You can use these models with the Transformers library. Here's a quick example:

Install lutfiy library for fixing ZWNJ

pip install lutfiy
from lutfiy import fix_zwnj
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_ckpt = "tahrirchi/lutfiy"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)

# Example translation
input_text = "O'zbekiston kelajagi buyuk davlatdir."

tokenizer.src_lang = "uzn_Latn"
tokenizer.tgt_lang = "uzs_Arab"

inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(fix_zwnj(translated_text)) # اۉزبېکستان کېلهجگی بویوک دولت دیر.

Training data

The models were trained on a parallel corpus of 40,000 sentence pairs, including:

  • Northern Uzbek - Southern Uzbek (37,415 pairs)
  • English - Southern Uzbek (2,579 pairs)

The dataset is available here.

Training procedure

For full details of the training procedure, please refer to our paper.

Citation

If you use these models in your research, please cite our paper:

@misc{mamasaidov2025fillinggapuzbekcreating,
      title={Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek}, 
      author={Mukhammadsaid Mamasaidov and Azizullah Aral and Abror Shopulatov and Mironshoh Inomjonov},
      year={2025},
      eprint={2508.14586},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.14586}, 
}

Contacts

We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Southern Uzbek.

For further development and issues about the dataset, please use [email protected] or [email protected] to contact.

Downloads last month
36
Safetensors
Model size
615M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tahrirchi/lutfiy

Finetuned
(188)
this model

Dataset used to train tahrirchi/lutfiy

Collection including tahrirchi/lutfiy