Papers
arxiv:2505.14395

MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language

Published on May 20
· Submitted by seyoungsong on May 23
Authors:
,
,
,
,
,

Abstract

MUG-Eval assesses LLMs' multilingual generation by transforming benchmarks into conversational tasks, offering a language-independent and NLP tool-free method that correlates well with established benchmarks.

AI-generated summary

Evaluating text generation capabilities of large language models (LLMs) is challenging, particularly for low-resource languages where methods for direct assessment are scarce. We propose MUG-Eval, a novel framework that evaluates LLMs' multilingual generation capabilities by transforming existing benchmarks into conversational tasks and measuring the LLMs' accuracies on those tasks. We specifically designed these conversational tasks to require effective communication in the target language. Then, we simply use task success rate as a proxy of successful conversation generation. Our approach offers two key advantages: it is independent of language-specific NLP tools or annotated datasets, which are limited for most languages, and it does not rely on LLMs-as-judges, whose evaluation quality degrades outside a few high-resource languages. We evaluate 8 LLMs across 30 languages spanning high, mid, and low-resource categories, and we find that MUG-Eval correlates strongly with established benchmarks (r > 0.75) while enabling standardized comparisons across languages and models. Our framework provides a robust and resource-efficient solution for evaluating multilingual generation that can be extended to thousands of languages.

Community

Paper author Paper submitter

MUG-Eval tackles a critical challenge in multilingual LLM evaluation: how to fairly assess generation capabilities across languages without relying on scarce reference texts or biased LLM judges. We introduce a clever framework using self-communication tasks (Easy Twenty Questions, MCQ Conversation, and Code Reconstruction) where LLMs must communicate effectively in the target language to complete tasks. This approach is brilliantly resource-efficient - requiring no human annotations or language-specific tools - yet correlates strongly with established benchmarks (r > 0.75). Evaluated on 8 LLMs across 30 languages, MUG-Eval reveals performance patterns across resource categories while offering better discriminative power than existing benchmarks. Perhaps most impressively, the framework is potentially scalable to over 2,000 languages via GlotLID, offering a truly language-agnostic solution for multilingual evaluation that could significantly advance equitable assessment of LLMs across the world's languages.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.14395 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.14395 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.14395 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.