Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
KaiChen1998Β 
posted an update 3 days ago
Post
176
πŸ€” Advanced reasoning LLMs keep releasing before your current MLLM alignment is done? Try our RACRO! Train once, flexible change to novel LLM reasoners during inference time!

πŸ“’ RACRO is a novel methodology to build multi-modal large reasoning models. By decoupling multi-modal reasoning into 1) query-conditioned captioning and 2) text-only reasoning, we achieve SoTA results on multi-modal reasoning benchmarks, while supporting flexible changes to any advanced reasoning models during inference. We further propose CRO, a novel GRPO-variant to reinforce query-conditioned captioning with only verifiable data for multi-modal mathematical questions.

✨ Highlights
βœ… State-of-the-art multi-modal reasoning: we achieve SoTA performance on multi-modal mathematical benchmarks, exceeding advanced commercial models like Claude-3.7-Sonnet and Gemini-2.0-Flash.
βœ… Inference-time scalability: thanks to the perceptual decoupling, we can flexibly change LLM reasoners during inference, providing a unique inference-time scalability for multi-modal reasoning.
βœ… Highly efficient: With only a single round of Caption Reward Optimization (CRO) training on ~39K samples, RACRO gets rid of burdensome multi-modal alignment (e.g., 4.1T tokens for Qwen2.5-VL).

πŸ”₯ You are all welcome to try and star!
- Paper: Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning (2506.04559)
- Github: https://github.com/gyhdog99/RACRO2
- Demo: Emova-ollm/RACRO-demo
In this post