TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation
Abstract
TopXGen uses LLMs to generate high-quality, topic-diverse target-side texts for LRLs, which can be backtranslated to improve translation performance in ICL and fine-tuning.
LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, this assumes the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present TopXGen, an LLM-based approach for the generation of high quality and topic-diverse data in multiple LRLs, which can then be backtranslated to produce useful and diverse parallel texts for ICL and fine-tuning. Our intuition is that while LLMs struggle to translate into LRLs, their ability to translate well into HRLs and their multilinguality enable them to generate good quality, natural-sounding target-side texts, which can be translated well into a high-resource source language. We show that TopXGen boosts LLM translation performance during fine-tuning and in-context learning. Code and outputs are available at https://github.com/ArmelRandy/topxgen.
Community
We introduce TopXGen, a pipeline for generating high-quality, topic-diverse synthetic data for low-resource languages using LLMs. While LLMs often struggle to correctly translate into LRLs, their multilingual capabilities allow them to produce decent, natural-sounding text in these languages, which can then be back-translated into a high-resource language to create parallel datasets. Unlike traditional back-translation, TopXGen does not require large existing corpora in the target language. We demonstrate that TopXGen improves MT performance in both supervised fine-tuning and in-context learning settings.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation (2025)
- Aligning Large Language Models to Low-Resource Languages through LLM-Based Selective Translation: A Systematic Study (2025)
- Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters (2025)
- Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models (2025)
- ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models (2025)
- Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs (2025)
- Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper