One RL to See Them All
- π GitHub Repo: MiniMax-AI/One-RL-to-See-Them-All
- π Paper (arXiv): V-Triune: One RL to See Them All (arXiv:2505.18129)
- πΎ Dataset: Orsta-Data-47k on Hugging Face
Model Overview
Orsta-7B is a cutting-edge vision-language model (VLM) designed to achieve superior performance across a wide spectrum of both visual reasoning and visual perception tasks. This model is a result of post-training with V-Triune, our novel unified reinforcement learning (RL) system.
The V-Triune system enables VLMs to be jointly optimized on diverse multimodal tasks within a single, cohesive training pipeline. Orsta-7B has been specifically trained using V-Triune on a carefully curated set of eight challenging visual tasks, fostering robust generalization and enhanced capabilities.
Training with V-Triune
Orsta-7B's advanced abilities stem from its training with the V-Triune system. Key aspects of its training include:
Unified RL Framework (V-Triune): V-Triune is a Visual Triple-Unified Reinforcement Learning system featuring three core complementary components:
- Sample-Level Data Formatting (to unify diverse task inputs)
- Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers)
- Source-Level Metric Monitoring (to diagnose problems at the data-source level)
- It also incorporates an innovative Dynamic IoU reward mechanism, crucial for optimizing visual perception tasks. You can find more details in our paper: V-Triune
Diverse Joint Task Optimization: Orsta-7B was jointly optimized on the following eight visual tasks:
- Visual Reasoning Tasks: Mathematics, Science Question Answering, Chart Understanding, and Puzzle Solving.
- Visual Perception Tasks: Object Detection, Visual Grounding, Optical Character Recognition (OCR), and Object Counting.
This comprehensive training allows Orsta-7B to develop a deeper understanding of visual content and its relation to textual prompts, excelling in tasks that require intricate reasoning and precise perception.
Performance
Model | Knowledge | Mathematics | Perception | Coding | Info. Ex. | Planning | Science | Metrics | MEGA-Bench Core |
---|---|---|---|---|---|---|---|---|---|
QwenVL-2-7B | 39.96 | 25.95 | 39.99 | 31.49 | 40.29 | 16.64 | 28.59 | 43.61 | 34.47 |
QwenVL-2.5-7B | 38.84 | 27.67 | 41.24 | 28.93 | 50.23 | 16.32 | 36.75 | 41.64 | 35.07 |
InternVL-3-8B | 36.64 | 32.75 | 42.17 | 35.11 | 48.92 | 14.35 | 36.51 | 53.94 | 36.48 |
Gemma3-12B | 41.11 | 29.10 | 37.38 | 30.27 | 46.56 | 16.10 | 36.83 | 50.40 | 35.04 |
Kimi-VL-A3B | 37.63 | 27.07 | 39.50 | 22.30 | 40.99 | 22.17 | 33.94 | 46.65 | 34.40 |
MM-Eureka-7B π‘ | 40.12 | 31.59 | 39.71 | 28.75 | 49.32 | 16.64 | 37.25 | 46.39 | 35.96 |
VL-Rethinker-7B π‘ | 40.65 | 30.08 | 42.02 | 29.87 | 52.03 | 17.83 | 36.82 | 46.90 | 37.25 |
Kimi-VL-A3B-Thinking π‘ | 33.45 | 17.76 | 28.11 | 14.69 | 41.14 | 12.64 | 28.60 | 43.97 | 27.08 |
Orsta-7B (Ours) π‘ | 41.65 | 31.48 | 43.84 | 32.82 | 54.07 | 17.83 | 36.91 | 41.66 | 38.31 |
- | - | - | - | - | - | - | - | - | - |
Ξ (Ours - Backbone) | +2.8 | +3.8 | +2.6 | +3.9 | +3.8 | +1.5 | +0.2 | +0.0 | +3.2 |
How to Use
Orsta-7B is developed by post-training the Qwen2.5-VL-7B-Instruct model using our V-Triune reinforcement learning system. Consequently, its core usage, particularly regarding input formatting and model interaction, largely follows the established patterns of the Qwen2.5-VL series.
For comprehensive details on the base model's capabilities, multi-turn dialogue format, image input encoding specifics, and other functionalities, we recommend referring to the official Qwen2.5-VL documentation.
Citation π
If you use Orsta-7B or the V-Triune system in your research, please cite our work:
@article{ma2025one,
title={One RL to See Them All: Visual Triple Unified Reinforcement Learning},
author={Ma, Yan and Du, Linge and Shen, Xuyang and Chen, Shaoxiang and Li, Pengfei and Ren, Qibing and Ma, Lizhuang and Dai, Yuchao and Liu, Pengfei and Yan, Junjie},
journal={arXiv preprint arXiv:2505.18129},
year={2025}
}
- Downloads last month
- 146