One RL to See Them All

🐙 GitHub Repo: MiniMax-AI/One-RL-to-See-Them-All
📜 Paper (arXiv): V-Triune: One RL to See Them All (arXiv:2505.18129)
💾 Dataset: Orsta-Data-47k on Hugging Face

Model Overview

Orsta-Orsta-32B-0326 is a cutting-edge vision-language model (VLM) designed to achieve superior performance across a wide spectrum of both visual reasoning and visual perception tasks. This model is a result of post-training with V-Triune, our novel unified reinforcement learning (RL) system.

The V-Triune system enables VLMs to be jointly optimized on diverse multimodal tasks within a single, cohesive training pipeline. Orsta-7B has been specifically trained using V-Triune on a carefully curated set of eight challenging visual tasks, fostering robust generalization and enhanced capabilities.

Training with V-Triune

Orsta-32B-0326's advanced abilities stem from its training with the V-Triune system. Key aspects of its training include:

Unified RL Framework (V-Triune): V-Triune is a Visual Triple-Unified Reinforcement Learning system featuring three core complementary components:
- Sample-Level Data Formatting (to unify diverse task inputs)
- Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers)
- Source-Level Metric Monitoring (to diagnose problems at the data-source level)
- It also incorporates an innovative Dynamic IoU reward mechanism, crucial for optimizing visual perception tasks. You can find more details in our paper: V-Triune
Diverse Joint Task Optimization: Orsta-32B-0326 was jointly optimized on the following eight visual tasks:
- Visual Reasoning Tasks: Mathematics, Science Question Answering, Chart Understanding, and Puzzle Solving.
- Visual Perception Tasks: Object Detection, Visual Grounding, Optical Character Recognition (OCR), and Object Counting.

This comprehensive training allows Orsta-32B-0326 to develop a deeper understanding of visual content and its relation to textual prompts, excelling in tasks that require intricate reasoning and precise perception.

Performance

Model	Knowledge	Mathematics	Perception	Coding	Info. Ex.	Planning	Science	Metrics	MEGA-Bench Core
Gemma3-27B	49.43	42.20	45.46	40.18	49.30	24.96	47.08	58.99	41.82 †
QwenVL-2.5-32B-0326	46.09	32.04	47.55	38.36	61.65	28.43	37.55	50.38	43.67
InternVL-3-38B	46.32	40.29	55.05	45.29	56.63	22.88	52.04	58.04	46.69
Skywork-R1V-38B 💡	25.59	28.45	22.95	19.88	19.53	9.74	22.64	37.55	21.54
Skywork-R1V2-38B 💡	17.08	12.38	15.65	7.14	9.90	17.60	14.29	0.0	15.39
Orsta-32B-0326 (Ours) 💡	46.78	37.43	50.86	38.92	63.14	28.05	42.68	53.01	45.78
-	-	-	-	-	-	-	-	-	-
Δ (Ours - Backbone)	+0.7	+5.4	+3.3	+0.6	+1.5	-0.4	+5.1	+2.6	+2.1

How to Use

Orsta-32B-0326 is developed by post-training the latest Qwen2.5-VL-32B-Instruct model using our V-Triune reinforcement learning system. Consequently, its core usage, particularly regarding input formatting and model interaction, largely follows the established patterns of the Qwen2.5-VL series.

For comprehensive details on the base model's capabilities, multi-turn dialogue format, image input encoding specifics, and other functionalities, we recommend referring to the official Qwen2.5-VL documentation.

Citation 🏆

If you use Orsta-32B-0326 or the V-Triune system in your research, please cite our work:

@article{ma2025one,
      title={One RL to See Them All: Visual Triple Unified Reinforcement Learning}, 
      author={Ma, Yan and Du, Linge and Shen, Xuyang and Chen, Shaoxiang and Li, Pengfei and Ren, Qibing and Ma, Lizhuang and Dai, Yuchao and Liu, Pengfei and Yan, Junjie},
      journal={arXiv preprint arXiv:2505.18129},
      year={2025}
}

One-RL-to-See-Them-All
/

Orsta-32B-0326

One RL to See Them All

Model Overview

Training with V-Triune

Performance

How to Use

Citation 🏆

Model tree for One-RL-to-See-Them-All/Orsta-32B-0326

Dataset used to train One-RL-to-See-Them-All/Orsta-32B-0326

Collection including One-RL-to-See-Them-All/Orsta-32B-0326

One-RL-to-See-Them-All