SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding
Abstract
SpatialScore benchmarks multimodal large language models for 3D spatial understanding, revealing challenges and showcasing the effectiveness of SpatialAgent with specialized tools.
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored. This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities? Concretely, we make the following contributions in this paper: (i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation; (ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard; (iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms; (iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent. We believe SpatialScore will offer valuable insights and serve as a rigorous benchmark for the next evolution of MLLMs.
Community
Project Page: https://haoningwu3639.github.io/SpatialScore/
Paper: https://arxiv.org/abs/2505.17012/
Code: https://github.com/haoningwu3639/SpatialScore/
Data: https://huggingface.co/datasets/haoningwu/SpatialScore
We are currently organizing our data and code, and expect to open-source them within 1-2 weeks! Please stay tuned! Feel free to reach out for discussions!
To summarize, we make the following contributions in this paper:
(i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation;
(ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard;
(iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms;
(iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models (2025)
- SpaceR: Reinforcing MLLMs in Video Spatial Reasoning (2025)
- NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving (2025)
- LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models (2025)
- SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning (2025)
- Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency (2025)
- VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper