SpaceOm

This model is evaluated in the paper SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence. The code for the SpaCE-10 benchmark is available at: https://github.com/Cuzyoung/SpaCE-10.

Model creator: remyxai
Original model: SpaceOm
GGUF quantization: llama.cpp commit 2baf07727f921d9a4a1b63a2eff941e95d0488ed

Description

Model Overview

SpaceOm improves over SpaceThinker by adding:

the target module o_proj in LoRA fine-tuning
SpaceOm dataset for longer reasoning traces
Robo2VLM-Reasoning dataset for more robotics domain and MCVQA examples

The choice to include o_proj among the target modules in LoRA finetuning was inspired by the study here, which argues for the importance of this module in reasoning models.

The reasoning traces in the SpaceThinker dataset average ~200 "thinking" tokens so now we've included longer reasoning traces in the training data to help the model use more tokens in reasoning.

Aiming to improve alignment for robotics applications, we've trained with synthetic reasoning traces, derived from the Robo2VLM-1 dataset.

Model Evaluation

SpatialScore - 3B and 4B models

Model	Overall	Count.	Obj.-Loc.	Pos.-Rel.	Dist.	Obj.-Prop.	Cam.&IT.	Tracking	Others
SpaceQwen2.5-VL-3B	42.31	45.01	49.78	57.88	27.36	34.11	26.34	26.44	43.58
SpatialBot-Phi2-3B	41.65	53.23	54.32	55.40	27.12	26.10	24.21	27.57	41.66
Kimi-VL-3B	51.48	49.22	61.99	61.34	38.27	46.74	33.75	56.28	47.23
Kimi-VL-3B-Thinking	52.60	52.66	58.93	63.28	39.38	42.57	32.00	46.97	42.73
Qwen2.5-VL-3B	47.90	46.62	55.55	62.23	32.39	32.97	30.66	36.90	42.19
InternVL2.5-4B	49.82	53.32	62.02	62.02	32.80	27.00	32.49	37.02	48.95
SpaceOm (3B)	49.00	56.00	54.00	65.00	41.00	50.00	36.00	42.00	47.00

See all results for evaluating SpaceOm on the SpatialScore benchmark.

Compared to SpaceQwen, this model outperforms by all categories

And comparing to SpaceThinker:

SpaCE-10 Benchmark Comparison

This table compares SpaceOm evaluated using GPT scoring against several top models from the SpaCE-10 benchmark leaderboard. Top scores in each category are bolded.

Model	EQ	SQ	SA	OO	OS	EP	FR	SP	Source
SpaceOm	32.47	24.81	47.63	50.00	32.52	9.12	37.04	25.00	GPT Eval
Qwen2.5-VL-7B-Instruct	32.70	31.00	41.30	32.10	27.60	15.40	26.30	27.50	Table
LLaVA-OneVision-7B	37.40	36.20	42.90	44.20	27.10	11.20	45.60	27.20	Table
VILA1.5-7B	30.20	38.60	39.90	44.10	16.50	35.10	30.10	37.60	Table
InternVL2.5-4B	34.30	34.40	43.60	44.60	16.10	30.10	33.70	36.70	Table

Legend:

EQ: Entity Quantification
SQ: Scene Quantification
SA: Size Assessment
OO: Object-Object spatial relations
OS: Object-Scene spatial relations
EP: Entity Presence
FR: Functional Reasoning
SP: Spatial Planning

ℹ️ Note: Scores for SpaceOm are generated via gpt_eval_score on single-choice (*-single) versions of the SpaCE-10 benchmark tasks. Other entries reflect leaderboard accuracy scores from the official SpaCE-10 evaluation table.

Limitations

Performance may degrade in cluttered environments or camera perspective.
This model was fine-tuned using synthetic reasoning over an internet image dataset.
Multimodal biases inherent to the base model (Qwen2.5-VL) may persist.
Not intended for use in safety-critical or legal decision-making.

Users are encouraged to evaluate outputs critically and consider fine-tuning for domain-specific safety and performance. Distances estimated using autoregressive transformers may help in higher-order reasoning for planning and behavior but may not be suitable replacements for measurements taken with high-precision sensors, calibrated stereo vision systems, or specialist monocular depth estimation models capable of more accurate, pixel-wise predictions and real-time performance.

Citation

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

@misc{qwen2.5-VL,
  title = {Qwen2.5-VL},
  url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
  author = {Qwen Team},
  month = {January},
  year = {2025}
}

@misc{vl-thinking2025,
  title={SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models },
  author={Hardy Chen and Haoqin Tu and Fali Wang and Hui Liu and Xianfeng Tang and Xinya Du and Yuyin Zhou and Cihang Xie},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/UCSC-VLAA/VLAA-Thinking}},
}


@article{wu2025spatialscore,
    author    = {Wu, Haoning and Huang, Xiao and Chen, Yaohui and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
    title     = {SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding},
    journal   = {arXiv preprint arXiv:2505.17012},
    year      = {2025},
}

@article{gong2025space10,
  title     = {SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence},
  author    = {Ziyang Gong and Wenhao Li and Oliver Ma and Songyuan Li and Jiayi Ji and Xue Yang and Gen Luo and Junchi Yan and Rongrong Ji},
  journal   = {arXiv preprint arXiv:2506.07966},
  year      = {2025},
  url       = {https://arxiv.org/abs/2506.07966}
}

Downloads last month: 88

GGUF

Model size

3.09B params

Architecture

qwen2vl

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mgonzs13/SpaceOm-GGUF

Base model

UCSC-VLAA/VLAA-Thinker-Qwen2.5VL-3B

Finetuned

remyxai/SpaceOm

Quantized

(3)

this model

Dataset used to train mgonzs13/SpaceOm-GGUF

Evaluation results

Overall Success Rate on 3DSRBench
self-reported

0.542
Overall Success Rate on 3DSRBench
self-reported

0.599
Overall Success Rate on 3DSRBench
self-reported

0.388
Overall Success Rate on 3DSRBench
self-reported

0.583
Overall Success Rate on 3DSRBench
self-reported

0.446
Overall Success Rate on 3DSRBench
self-reported

0.488
Overall Success Rate on 3DSRBench
self-reported

0.611
Overall Success Rate on 3DSRBench
self-reported

0.704
Overall Success Rate on 3DSRBench
self-reported

0.350
Overall Success Rate on 3DSRBench
self-reported

0.256

View on Papers With Code