Papers
arxiv:2505.10557

MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

Published on May 15
ยท Submitted by scikkk on May 15
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.

Community

Paper submitter
  • [2025.05.16] ๐Ÿค— MathCoder-VL-2B, MathCoder-VL-8B and FigCodifier-8B is available now! ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ
  • [2025.05.16] Our MathCoder-VL is accepted to ACL 2025 Findings. ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.10557 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.10557 in a Space README.md to link it from this page.

Collections including this paper 1