More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
Abstract
A new metric and benchmark are introduced to evaluate multimodal large language models' ability to maintain visual grounding while performing extended reasoning, revealing that larger models and specific training data types improve this balance.
Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DeepEyes: Incentivizing"Thinking with Images"via Reinforcement Learning (2025)
- Visual Abstract Thinking Empowers Multimodal Reasoning (2025)
- VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge (2025)
- OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning (2025)
- EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models (2025)
- Critique Before Thinking: Mitigating Hallucination through Rationale-Augmented Instruction Tuning (2025)
- LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper