Papers
arxiv:2508.14586

Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek

Published on Aug 20
Authors:
,
,

Abstract

New resources and a fine-tuned model for Southern Uzbek machine translation are presented, including datasets and a post-processing method to improve morphological handling.

AI-generated summary

Southern Uzbek (uzs) is a Turkic language variety spoken by around 5 million people in Afghanistan and differs significantly from Northern Uzbek (uzn) in phonology, lexicon, and orthography. Despite the large number of speakers, Southern Uzbek is underrepresented in natural language processing. We present new resources for Southern Uzbek machine translation, including a 997-sentence FLORES+ dev set, 39,994 parallel sentences from dictionary, literary, and web sources, and a fine-tuned NLLB-200 model (lutfiy). We also propose a post-processing method for restoring Arabic-script half-space characters, which improves handling of morphological boundaries. All datasets, models, and tools are released publicly to support future work on Southern Uzbek and other low-resource languages.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.14586 in a Space README.md to link it from this page.

Collections including this paper 1