Papers
arxiv:2505.18129

One RL to See Them All: Visual Triple Unified Reinforcement Learning

Published on May 23
· Submitted by Ryan1122 on May 26
Authors:
Yan Ma ,
,
,
,

Abstract

A unified reinforcement learning system, V-Triune, combines visual reasoning and perception tasks in vision-language models through a single training pipeline, achieving significant improvements across various tasks.

AI-generated summary

Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.

Community

Paper author Paper submitter

V-Triune is a visual unified reinforcement learning (RL) system that enables vision-language models (VLMs) to jointly learn reasoning and perception tasks. It integrates three key components—sample-level data formatting, verifier-level reward computation, and source-level metric monitoring—and introduces a novel Dynamic IoU reward for adaptive perception feedback. Built on open-source 7B and 32B models, the resulting system, Orsta, achieves significant performance gains (up to +14.1) across diverse tasks in MEGA-Bench Core, demonstrating the scalability and effectiveness of RL beyond reasoning.

The models and code are available at MiniMax/One-RL-to-See-Them-All.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.18129 in a Space README.md to link it from this page.

Collections including this paper 17