Papers
arxiv:2507.12508

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Published on Jul 16
· Submitted by yyuncong on Jul 18
Authors:
,
,
,
,
,
,

Abstract

MindJourney enhances vision-language models with 3D reasoning by coupling them with a video diffusion-based world model, achieving improved performance on spatial reasoning tasks without fine-tuning.

AI-generated summary

Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 8% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.

Community

Paper author Paper submitter

Test-Time Scaling has been very effective in tasks like code generation and solving math problems, but what about tasks in the 3D Physical World?
We are excited to introduce MindJourney, a novel test-time scaling framework that uses world models as the source of imagination in 3D spaces to solve spatial reasoning questions.
Project page: https://umass-embodied-agi.github.io/MindJourney/
Code: https://github.com/UMass-Embodied-AGI/MindJourney

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.12508 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.12508 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.12508 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.