One RL to See Them All

🐙 GitHub Repo: MiniMax-AI/One-RL-to-See-Them-All
📜 Paper (arXiv): V-Triune: One RL to See Them All (arXiv:2505.18129)
💾 Dataset: Orsta-Data-47k on Hugging Face

Model Overview

Orsta-32B-0321 is a cutting-edge vision-language model (VLM) designed to achieve superior performance across a wide spectrum of both visual reasoning and visual perception tasks. This model is a result of post-training with V-Triune, our novel unified reinforcement learning (RL) system.

The V-Triune system enables VLMs to be jointly optimized on diverse multimodal tasks within a single, cohesive training pipeline. Orsta-32B-0321 has been specifically trained using V-Triune on a carefully curated set of eight challenging visual tasks, fostering robust generalization and enhanced capabilities.

Training with V-Triune

Orsta-32B-0321's advanced abilities stem from its training with the V-Triune system. Key aspects of its training include:

Unified RL Framework (V-Triune): V-Triune is a Visual Triple-Unified Reinforcement Learning system featuring three core complementary components:
- Sample-Level Data Formatting (to unify diverse task inputs)
- Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers)
- Source-Level Metric Monitoring (to diagnose problems at the data-source level) * It also incorporates an innovative Dynamic IoU reward mechanism, crucial for optimizing visual perception tasks. You can find more details in our paper: V-Triune
Diverse Joint Task Optimization: Orsta-32B-0321 was jointly optimized on the following eight visual tasks:
- Visual Reasoning Tasks: Mathematics, Science Question Answering, Chart Understanding, and Puzzle Solving.
- Visual Perception Tasks: Object Detection, Visual Grounding, Optical Character Recognition (OCR), and Object Counting.

This comprehensive training allows Orsta-32B-0321 to develop a deeper understanding of visual content and its relation to textual prompts, excelling in tasks that require intricate reasoning and precise perception.

Performance

Model	Knowledge	Mathematics	Perception	Coding	Info. Ex.	Planning	Science	Metrics	MEGA-Bench Core
QwenVL-2.5-32B-0321	8.48	12.62	11.99	13.59	15.44	8.61	16.78	14.91	11.87
MM-Eureka-32B 💡	12.20	20.19	21.88	15.86	21.23	15.47	19.95	22.77	18.57
VL-Rethinker-32B 💡	12.16	28.09	22.99	11.89	21.50	15.09	28.10	15.73	19.41
Orsta-32B-0321 (Ours) 💡	21.33	28.55	32.23	19.44	26.38	17.78	33.20	24.18	25.94
-	-	-	-	-	-	-	-	-	-
Δ (Ours - Backbone)	+12.9	+15.9	+20.2	+5.9	+10.9	+9.2	+16.4	+9.3	+14.1

How to Use

Orsta-32B-0321 is developed by post-training the Qwen2.5-VL-32B-Instruct (0321 checkpoint) model using our V-Triune reinforcement learning system. The Qwen2.5-VL-32B-Instruct (0321 checkpoint) is a publicly available baseline known for its reliable core reasoning abilities, alongside certain recognized limitations in perception and output formatting (which have been addressed in subsequent Qwen releases). Applying V-Triune to this specific baseline demonstrates its powerful post-training capability to unlock the model's inherent potential and significantly elevate its performance by refining and amplifying existing strengths.

Consequently, the core usage of Orsta-32B-0321, particularly regarding input formatting and model interaction, largely follows the established patterns of the Qwen2.5-VL series. Users familiar with Qwen2.5-VL models should find the interface intuitive.

For comprehensive details on the general capabilities of Qwen2.5-VL models, including multi-turn dialogue format and image input specifics, we recommend referring to the official Qwen2.5-VL series documentation (please ensure to consult information relevant to the 32B Instruct version).

Citation 🏆

If you use Orsta-32B-0321 or the V-Triune system in your research, please cite our work:

@article{ma2025one,
      title={One RL to See Them All: Visual Triple Unified Reinforcement Learning}, 
      author={Ma, Yan and Du, Linge and Shen, Xuyang and Chen, Shaoxiang and Li, Pengfei and Ren, Qibing and Ma, Lizhuang and Dai, Yuchao and Liu, Pengfei and Yan, Junjie},
      journal={arXiv preprint arXiv:2505.18129},
      year={2025}
}

One-RL-to-See-Them-All
/

Orsta-32B-0321

One RL to See Them All

Model Overview

Training with V-Triune

Performance

How to Use

Citation 🏆

Model tree for One-RL-to-See-Them-All/Orsta-32B-0321

Dataset used to train One-RL-to-See-Them-All/Orsta-32B-0321

Collection including One-RL-to-See-Them-All/Orsta-32B-0321

One-RL-to-See-Them-All