arxiv:2508.08680

TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation

Published on Aug 12

· Submitted by

Authors:

Abstract

TopXGen uses LLMs to generate high-quality, topic-diverse target-side texts for LRLs, which can be backtranslated to improve translation performance in ICL and fine-tuning.

AI-generated summary

LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, this assumes the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present TopXGen, an LLM-based approach for the generation of high quality and topic-diverse data in multiple LRLs, which can then be backtranslated to produce useful and diverse parallel texts for ICL and fine-tuning. Our intuition is that while LLMs struggle to translate into LRLs, their ability to translate well into HRLs and their multilinguality enable them to generate good quality, natural-sounding target-side texts, which can be translated well into a high-resource source language. We show that TopXGen boosts LLM translation performance during fine-tuning and in-context learning. Code and outputs are available at https://github.com/ArmelRandy/topxgen.

View arXiv page View PDF GitHub 1 Add to collection

Community

ArmelRandy

Paper submitter about 18 hours ago

We introduce TopXGen, a pipeline for generating high-quality, topic-diverse synthetic data for low-resource languages using LLMs. While LLMs often struggle to correctly translate into LRLs, their multilingual capabilities allow them to produce decent, natural-sounding text in these languages, which can then be back-translated into a high-resource language to create parallel datasets. Unlike traditional back-translation, TopXGen does not require large existing corpora in the target language. We demonstrate that TopXGen improves MT performance in both supervised fine-tuning and in-context learning settings.

Code: https://github.com/ArmelRandy/topxgen