arxiv:2506.23918

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Published on Jun 30

· Submitted by

Warrieryes on Jul 4

Upvote

Authors:

Xiaoye Qu ,

Abstract

Multimodal reasoning models are transitioning from static text-based vision to dynamic, integrated use of visual information as part of their cognitive processes.

AI-generated summary

Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.

View arXiv page View PDF GitHub 534 Add to collection

Community

Warrieryes

Paper submitter about 7 hours ago

This survey provides a foundational framework for the "Think with Image" paradigm, which moves beyond static visual perception to active, multi-step visual reasoning. The survey organizes the field into a three-stage evolution of increasing cognitive autonomy: from leveraging external tools, to programmatically generating visual operations, and finally to performing intrinsic visual imagination. By systematically analyzing the core methodologies, applications, and challenges associated with each stage, this work aims to offer a roadmap for developing the next generation of multimodal AI.