MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
Abstract
MCIF is a multilingual, human-annotated benchmark for evaluating instruction-following in crosslingual, multimodal settings using scientific talks.
Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations -- hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities -- speech, vision, and text -- and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MultiVox: Benchmarking Voice Assistants for Multimodal Interactions (2025)
- MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation (2025)
- OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs (2025)
- A Culturally-diverse Multilingual Multimodal Video Benchmark & Model (2025)
- NeoBabel: A Multilingual Open Tower for Visual Generation (2025)
- POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering (2025)
- ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper