VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
Abstract
VideoEval-Pro, a benchmark using open-ended questions, provides a more accurate measure of long video understanding compared to existing multiple-choice question benchmarks.
Large multimodal models (LMMs) have recently emerged as a powerful tool for long video understanding (LVU), prompting the development of standardized LVU benchmarks to evaluate their performance. However, our investigation reveals a rather sober lesson for existing LVU benchmarks. First, most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer; Second, a significant portion of questions in these benchmarks have strong priors to allow models to answer directly without even reading the input video. For example, Gemini-1.5-Pro can achieve over 50\% accuracy given a random frame from a long video on Video-MME. We also observe that increasing the number of frames does not necessarily lead to improvement on existing benchmarks, which is counterintuitive. As a result, the validity and robustness of current LVU benchmarks are undermined, impeding a faithful assessment of LMMs' long-video understanding capability. To tackle this problem, we propose VideoEval-Pro, a realistic LVU benchmark containing questions with open-ended short-answer, which truly require understanding the entire video. VideoEval-Pro assesses both segment-level and full-video understanding through perception and reasoning tasks. By evaluating 21 proprietary and open-source video LMMs, we conclude the following findings: (1) video LMMs show drastic performance (>25\%) drops on open-ended questions compared with MCQs; (2) surprisingly, higher MCQ scores do not lead to higher open-ended scores on VideoEval-Pro; (3) compared to other MCQ benchmarks, VideoEval-Pro benefits more from increasing the number of input frames. Our results show that VideoEval-Pro offers a more realistic and reliable measure of long video understanding, providing a clearer view of progress in this domain.
Community
We present VideoEval-Pro, a more robust and realistic long video understanding benchmark.
Homepage: https://tiger-ai-lab.github.io/VideoEval-Pro
Huggingface Dataset: https://huggingface.co/datasets/TIGER-Lab/VideoEval-Pro
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding (2025)
- Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark (2025)
- VideoAds for Fast-Paced Video Understanding: Where Opensource Foundation Models Beat GPT-4o&Gemini-1.5 Pro (2025)
- IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs (2025)
- BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding (2025)
- VEU-Bench: Towards Comprehensive Understanding of Video Editing (2025)
- Vidi: Large Multimodal Models for Video Understanding and Editing (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper