arxiv:2505.17613

MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

Published on May 23

· Submitted by

yushihu on May 28

Upvote

Authors:

Jihan Yao ,

Yushi Hu ,

Abstract

MMMG is a comprehensive benchmark for multimodal generation, offering 49 tasks and 937 instructions to align automatic evaluation with human judgment, revealing areas for improvement in reasoning and audio generation.

AI-generated summary

Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

yushihu

Paper author Paper submitter 5 days ago

A comprehensive and reliable benchmark for multimodal generation ( image, audio, interleaved text, and image, interleaved text and audio)
✅ Each task has a carefully crafted automatic evaluation pipeline to ensure reliability
✅ Much more aligned with humans than other benchmarks
✅ Comprehensive: 4 modality combination, 49 tasks, 937 instructions

librarian-bot

5 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.17613 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.17613 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.