Post
176
๐ค Advanced reasoning LLMs keep releasing before your current MLLM alignment is done? Try our RACRO! Train once, flexible change to novel LLM reasoners during inference time!
๐ข RACRO is a novel methodology to build multi-modal large reasoning models. By decoupling multi-modal reasoning into 1) query-conditioned captioning and 2) text-only reasoning, we achieve SoTA results on multi-modal reasoning benchmarks, while supporting flexible changes to any advanced reasoning models during inference. We further propose CRO, a novel GRPO-variant to reinforce query-conditioned captioning with only verifiable data for multi-modal mathematical questions.
โจ Highlights
โ State-of-the-art multi-modal reasoning: we achieve SoTA performance on multi-modal mathematical benchmarks, exceeding advanced commercial models like Claude-3.7-Sonnet and Gemini-2.0-Flash.
โ Inference-time scalability: thanks to the perceptual decoupling, we can flexibly change LLM reasoners during inference, providing a unique inference-time scalability for multi-modal reasoning.
โ Highly efficient: With only a single round of Caption Reward Optimization (CRO) training on ~39K samples, RACRO gets rid of burdensome multi-modal alignment (e.g., 4.1T tokens for Qwen2.5-VL).
๐ฅ You are all welcome to try and star!
- Paper: Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning (2506.04559)
- Github: https://github.com/gyhdog99/RACRO2
- Demo: Emova-ollm/RACRO-demo
๐ข RACRO is a novel methodology to build multi-modal large reasoning models. By decoupling multi-modal reasoning into 1) query-conditioned captioning and 2) text-only reasoning, we achieve SoTA results on multi-modal reasoning benchmarks, while supporting flexible changes to any advanced reasoning models during inference. We further propose CRO, a novel GRPO-variant to reinforce query-conditioned captioning with only verifiable data for multi-modal mathematical questions.
โจ Highlights
โ State-of-the-art multi-modal reasoning: we achieve SoTA performance on multi-modal mathematical benchmarks, exceeding advanced commercial models like Claude-3.7-Sonnet and Gemini-2.0-Flash.
โ Inference-time scalability: thanks to the perceptual decoupling, we can flexibly change LLM reasoners during inference, providing a unique inference-time scalability for multi-modal reasoning.
โ Highly efficient: With only a single round of Caption Reward Optimization (CRO) training on ~39K samples, RACRO gets rid of burdensome multi-modal alignment (e.g., 4.1T tokens for Qwen2.5-VL).
๐ฅ You are all welcome to try and star!
- Paper: Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning (2506.04559)
- Github: https://github.com/gyhdog99/RACRO2
- Demo: Emova-ollm/RACRO-demo