One RL to See Them All

Model Overview

Orsta-7B is a cutting-edge vision-language model (VLM) designed to achieve superior performance across a wide spectrum of both visual reasoning and visual perception tasks. This model is a result of post-training with V-Triune, our novel unified reinforcement learning (RL) system.

The V-Triune system enables VLMs to be jointly optimized on diverse multimodal tasks within a single, cohesive training pipeline. Orsta-7B has been specifically trained using V-Triune on a carefully curated set of eight challenging visual tasks, fostering robust generalization and enhanced capabilities.

Training with V-Triune

Orsta-7B's advanced abilities stem from its training with the V-Triune system. Key aspects of its training include:

  • Unified RL Framework (V-Triune): V-Triune is a Visual Triple-Unified Reinforcement Learning system featuring three core complementary components:

    • Sample-Level Data Formatting (to unify diverse task inputs)
    • Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers)
    • Source-Level Metric Monitoring (to diagnose problems at the data-source level)
    • It also incorporates an innovative Dynamic IoU reward mechanism, crucial for optimizing visual perception tasks. You can find more details in our paper: V-Triune
  • Diverse Joint Task Optimization: Orsta-7B was jointly optimized on the following eight visual tasks:

    • Visual Reasoning Tasks: Mathematics, Science Question Answering, Chart Understanding, and Puzzle Solving.
    • Visual Perception Tasks: Object Detection, Visual Grounding, Optical Character Recognition (OCR), and Object Counting.

This comprehensive training allows Orsta-7B to develop a deeper understanding of visual content and its relation to textual prompts, excelling in tasks that require intricate reasoning and precise perception.

Performance

Model Knowledge Mathematics Perception Coding Info. Ex. Planning Science Metrics MEGA-Bench
Core
QwenVL-2-7B 39.96 25.95 39.99 31.49 40.29 16.64 28.59 43.61 34.47
QwenVL-2.5-7B 38.84 27.67 41.24 28.93 50.23 16.32 36.75 41.64 35.07
InternVL-3-8B 36.64 32.75 42.17 35.11 48.92 14.35 36.51 53.94 36.48
Gemma3-12B 41.11 29.10 37.38 30.27 46.56 16.10 36.83 50.40 35.04
Kimi-VL-A3B 37.63 27.07 39.50 22.30 40.99 22.17 33.94 46.65 34.40
MM-Eureka-7B πŸ’‘ 40.12 31.59 39.71 28.75 49.32 16.64 37.25 46.39 35.96
VL-Rethinker-7B πŸ’‘ 40.65 30.08 42.02 29.87 52.03 17.83 36.82 46.90 37.25
Kimi-VL-A3B-Thinking πŸ’‘ 33.45 17.76 28.11 14.69 41.14 12.64 28.60 43.97 27.08
Orsta-7B (Ours) πŸ’‘ 41.65 31.48 43.84 32.82 54.07 17.83 36.91 41.66 38.31
- - - - - - - - - -
Ξ” (Ours - Backbone) +2.8 +3.8 +2.6 +3.9 +3.8 +1.5 +0.2 +0.0 +3.2

How to Use

Orsta-7B is developed by post-training the Qwen2.5-VL-7B-Instruct model using our V-Triune reinforcement learning system. Consequently, its core usage, particularly regarding input formatting and model interaction, largely follows the established patterns of the Qwen2.5-VL series.

For comprehensive details on the base model's capabilities, multi-turn dialogue format, image input encoding specifics, and other functionalities, we recommend referring to the official Qwen2.5-VL documentation.

Citation πŸ†

If you use Orsta-7B or the V-Triune system in your research, please cite our work:

@article{ma2025one,
      title={One RL to See Them All: Visual Triple Unified Reinforcement Learning}, 
      author={Ma, Yan and Du, Linge and Shen, Xuyang and Chen, Shaoxiang and Li, Pengfei and Ren, Qibing and Ma, Lizhuang and Dai, Yuchao and Liu, Pengfei and Yan, Junjie},
      journal={arXiv preprint arXiv:2505.18129},
      year={2025}
}
Downloads last month
146
Safetensors
Model size
8.29B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for One-RL-to-See-Them-All/Orsta-7B

Finetuned
(322)
this model
Quantizations
2 models

Dataset used to train One-RL-to-See-Them-All/Orsta-7B

Collection including One-RL-to-See-Them-All/Orsta-7B