MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language
Abstract
MUG-Eval assesses LLMs' multilingual generation by transforming benchmarks into conversational tasks, offering a language-independent and NLP tool-free method that correlates well with established benchmarks.
Evaluating text generation capabilities of large language models (LLMs) is challenging, particularly for low-resource languages where methods for direct assessment are scarce. We propose MUG-Eval, a novel framework that evaluates LLMs' multilingual generation capabilities by transforming existing benchmarks into conversational tasks and measuring the LLMs' accuracies on those tasks. We specifically designed these conversational tasks to require effective communication in the target language. Then, we simply use task success rate as a proxy of successful conversation generation. Our approach offers two key advantages: it is independent of language-specific NLP tools or annotated datasets, which are limited for most languages, and it does not rely on LLMs-as-judges, whose evaluation quality degrades outside a few high-resource languages. We evaluate 8 LLMs across 30 languages spanning high, mid, and low-resource categories, and we find that MUG-Eval correlates strongly with established benchmarks (r > 0.75) while enabling standardized comparisons across languages and models. Our framework provides a robust and resource-efficient solution for evaluating multilingual generation that can be extended to thousands of languages.
Community
MUG-Eval tackles a critical challenge in multilingual LLM evaluation: how to fairly assess generation capabilities across languages without relying on scarce reference texts or biased LLM judges. We introduce a clever framework using self-communication tasks (Easy Twenty Questions, MCQ Conversation, and Code Reconstruction) where LLMs must communicate effectively in the target language to complete tasks. This approach is brilliantly resource-efficient - requiring no human annotations or language-specific tools - yet correlates strongly with established benchmarks (r > 0.75). Evaluated on 8 LLMs across 30 languages, MUG-Eval reveals performance patterns across resource categories while offering better discriminative power than existing benchmarks. Perhaps most impressively, the framework is potentially scalable to over 2,000 languages via GlotLID, offering a truly language-agnostic solution for multilingual evaluation that could significantly advance equitable assessment of LLMs across the world's languages.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models (2025)
- Evaluating Compact LLMs for Zero-Shot Iberian Language Tasks on End-User Devices (2025)
- IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation (2025)
- Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks (2025)
- How Reliable is Multilingual LLM-as-a-Judge? (2025)
- Is LLM the Silver Bullet to Low-Resource Languages Machine Translation? (2025)
- Mind the Language Gap: Automated and Augmented Evaluation of Bias in LLMs for High- and Low-Resource Languages (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper