arxiv:2503.22952

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Published on Mar 29

· Submitted by

ColorfulAI on Apr 2

Upvote

Authors:

Yuxuan Wang ,

Zilong Zheng

Abstract

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

View arXiv page View PDF Add to collection

Community

ColorfulAI

Paper author Paper submitter about 23 hours ago

🚀 [CVPR 2025] OmniMMI

🌟 Benchmark Core Capabilities of Omni Models:
▫️ Streaming Understanding: Connects past, present, and future
▫️ Interaction Modeling: Speaker recognition × proactive responses × turn taking

🔥 Open-Source Suite:
❶ OmniMMI: The benchmark for evaluating Omni models
🔗 https://omnimmi.github.io/
❷ M4 Framework: Enables MLLM duplex interruption & proactive output with just $4 worth of data
🔗 https://github.com/patrick-tssn/M4
❸ Open-Omni-Nexus: A unified framework for training LLMs, LVLMs, and LALMs into Omni models
▸ Supports multi-scale Qwen/Llama
▸ Plug-and-play open-source datasets
🔗 https://github.com/patrick-tssn/Open-Omni-Nexus

✨ Full-stack evolution in listening, speaking, reading, and writing—welcome to discuss!