arxiv:2505.03318

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Published on May 6

· Submitted by

CodeGoat24 on May 7

#2 Paper of the day

Upvote

Authors:

Yibin Wang ,

Yuhang Zang ,

Abstract

Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

CodeGoat24

Paper author Paper submitter 3 days ago

We release UnifiedReward-Think -- the first unified multimodal CoT reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks.

📌 project page：https://codegoat24.github.io/UnifiedReward/think
📄 paper：https://arxiv.org/pdf/2505.03318
💻 GitHub：https://github.com/CodeGoat24/UnifiedReward
🤗 Model：https://huggingface.co/collections/CodeGoat24/unifiedreward-models-67c3008148c3a380d15ac63a
📊 Dataset：https://huggingface.co/collections/CodeGoat24/unifiedreward-training-data-67c300d4fd5eff00fa7f1ede

Jack99j

2 days ago

چگونه یک کارگاه آموزشی موفق سه ساعتی برای دانشجویان مبتدی و بدون کد در باره , " آشنایی مقدماتی با هوش مصنوعی و کاربردهای آن در تحقیقات و پایان نامه های دانشگاه ". برگزار کنم