File size: 11,500 Bytes
3e62f66
 
053d65c
 
3e62f66
053d65c
3e62f66
32e9fb0
 
 
3e62f66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
053d65c
3e62f66
 
 
 
 
 
 
 
053d65c
3e62f66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
053d65c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
---
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
---

### UI-Venus
This repository contains the UI-Venus model from the report [UI-Venus: Building High-performance UI Agents with RFT](https://arxiv.org/abs/2508.10833). 

UI-Venus is a native UI agent based on the Qwen2.5-VL multimodal large language model, designed to perform precise GUI element grounding and effective navigation using only screenshots as input. It achieves state-of-the-art performance through Reinforcement Fine-Tuning (RFT) with high-quality training data. More inference details and usage guides are available in the GitHub repository. We will continue to update results on standard benchmarks including Screenspot-v2/Pro and AndroidWorld.



[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Report](https://img.shields.io/badge/Report-Technical%20Report-blueviolet?logo=notion)](http://arxiv.org/abs/2508.10833)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-green?logo=github)](https://github.com/inclusionAI/UI-Venus)
[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Model-orange?logo=huggingface)](https://huggingface.co/inclusionAI/UI-Venus-Navi-7B)

---

 <p align="center">
  πŸ“ˆ UI-Venus Benchmark Performance
</p>

<p align="center">
  <img src="performance_venus.png" alt="UI-Venus Performance Across Datasets" width="1200" />
  <br>
</p>

> **Figure:** Performance of UI-Venus across multiple benchmark datasets. UI-Venus achieves **State-of-the-Art (SOTA)** results on key UI understanding and interaction benchmarks, including **ScreenSpot-Pro**, **ScreenSpot-v2**, **OS-World-G**, **UI-Vision**, and **Android World**. The results demonstrate its superior capability in visual grounding, UI navigation, cross-platform generalization, and complex task reasoning.

### Model Description

UI-Venus is a multimodal UI agent built on Qwen2.5-VL that performs accurate UI grounding and navigation using only screenshots as input. The 7B and 72B variants achieve 94.1%/50.8% and 95.3%/61.9% on Screenspot-V2 and Screenspot-Pro benchmarks, surpassing prior SOTA models such as GTA1 and UI-TARS-1.5. On the AndroidWorld navigation benchmark, they achieve 49.1% and 65.9% success rates, respectively, demonstrating strong planning and generalization capabilities

Key innovations include:
- **SOTA Open-Source UI Agent**: Publicly released to advance research in autonomous UI interaction and agent-based systems.
- **Reinforcement Fine-Tuning (RFT)**: Utilizes carefully designed reward functions for both grounding and navigation tasks
- **Efficient Data Cleaning**: Trained on several hundred thousand high-quality samples to ensure robustness.
- **Self-Evolving Trajectory History Alignment & Sparse Action Enhancement**: Improves reasoning coherence and action distribution for better long-horizon planning.

---
## Installation

First, install the required dependencies:

```python
pip install transformers==4.49.0 qwen-vl-utils
```
---
  
## Quick Start
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from typing import Dict, Tuple, Any
import torch
import os
import re
from qwen_vl_utils import process_vision_info

# -----------------------------
# Model & Tokenizer
# -----------------------------
MODEL_NAME = "inclusionAI/UI-Venus-Navi-72B"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
).eval()

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_NAME)

GENERATION_CONFIG = {
    "max_new_tokens": 2048,
    "do_sample": False,
    "temperature": 0.0,
}

# -----------------------------
# Prompt Template
# -----------------------------
PROMPT_TEMPLATE = """**You are a GUI Agent.**
Your task is to analyze a given user task, review current screenshot and previous actions, and determine the next action to complete the task.

### User Task
{user_task}

### Previous Actions
{previous_actions}

### Available Actions
Click(box=(x1, y1))
Drag(start=(x1, y1), end=(x2, y2))
Scroll(start=(x1, y1), end=(x2, y2), direction='down/up/right/left')
Type(content='')
Launch(app='')
Wait()
Finished(content='')
CallUser(content='')
LongPress(box=(x1, y1))
PressBack()
PressHome()
PressEnter()
PressRecent()

### Instruction
- Make sure you understand the task goal to avoid wrong actions.
- Examine the screenshot carefully. History may be unreliable.
- For user questions, reply with `CallUser`, then `Finished` if done.
- Explore screen content using scroll in different directions.
- Copy text: select β†’ click `copy`.
- Paste text: long press text box β†’ click `paste`.
- First reason inside <think>, then provide <action>, then summarize in <conclusion>.
"""

# -----------------------------
# Parse action
# -----------------------------
def parse_action(action_str: str) -> Tuple[str, Dict[str, Any]]:
    """Parse action string into action type + params."""
    pattern = r"^(\w+)\((.*)\)$"
    match = re.match(pattern, action_str.strip(), re.DOTALL)
    if not match:
        print(f"Invalid action type: {action_str}")
        return "", {}

    action_type, params_str = match.group(1), match.group(2).strip()
    params = {}

    if params_str:
        try:
            # split by comma not inside parentheses
            param_pairs = re.split(r",(?![^(]*\))", params_str)
            for pair in param_pairs:
                if "=" in pair:
                    key, value = pair.split("=", 1)
                    params[key.strip()] = value.strip().strip("'").strip()
                else:
                    params[pair.strip()] = None
        except Exception as e:
            print(f"Parse param failed: {e}")
            return action_type, {}
    return action_type, params


def extract_tag(content: str, tag: str) -> str:
    """Extract latest <tag>...</tag> content from model output."""
    pattern = fr"<{tag}>(.*?)</{tag}>"
    matches = list(re.finditer(pattern, content, re.DOTALL))
    if not matches:
        print(f"{tag} Not Found")
        return ""
    return matches[-1].group(1).strip()

# -----------------------------
# Inference
# -----------------------------
def inference(image_path: str, goal: str) -> Dict[str, str]:
    if not (os.path.exists(image_path) and os.path.isfile(image_path)):
        raise FileNotFoundError(f"Invalid input image path: {image_path}")

    full_prompt = PROMPT_TEMPLATE.format(user_task=goal, previous_actions="")

    messages = [{
        "role": "user",
        "content": [
            {"type": "text", "text": full_prompt},
            {"type": "image", "image": image_path, "min_pixels": 3136, "max_pixels": 12845056},
        ],
    }]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)

    model_inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt"
    ).to(model.device)

    generated_ids = model.generate(**model_inputs, **GENERATION_CONFIG)
    generated_ids_trimmed = [out[len(inp):] for inp, out in zip(model_inputs.input_ids, generated_ids)]
    output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0]

    return {
        "raw_response": output_text,
        "think": extract_tag(output_text, "think"),
        "action": extract_tag(output_text, "action"),
        "conclusion": extract_tag(output_text, "conclusion"),
    }
```

### Usage
⚠️ For action types that include coordinates (e.g., click, scroll),  
the current code does **not** handle coordinate conversion.  
You need to map the coordinates back to the original image space using `max_pixels` and `min_pixels` before applying them.

---
### Results on AndroidWorld
This is the compressed package of validation trajectories for **AndroidWorld**, including execution logs and navigation paths.  
πŸ“₯ Download: [UI-Venus-androidworld.zip](https://github.com/inclusionAI/UI-Venus)

| Models | With Planner | A11y Tree | Screenshot | Success Rate (pass@1) |
|--------|--------------|-----------|------------|------------------------|
| **Closed-source Models** | | | | |
| GPT-4o| ❌ | βœ… | ❌ | 30.6 |
| ScaleTrack| ❌ | βœ… | ❌ | 44.0 |
| SeedVL-1.5 | ❌ | βœ… | βœ… | 62.1 |
| UI-TARS-1.5 | ❌ | ❌ | βœ… | 64.2 |
| **Open-source Models** | | | | |
| GUI-Critic-R1-7B | ❌ | βœ… | βœ… | 27.6 |
| Qwen2.5-VL-72B* | ❌ | ❌ | βœ… | 35.0 |
| UGround | βœ… | ❌ | βœ… | 44.0 |
| Aria-UI | βœ… | ❌ | βœ… | 44.8 |
| UI-TARS-72B | ❌ | ❌ | βœ… | 46.6 |
| GLM-4.5v | ❌ | ❌ | βœ… | 57.0 |
| **Ours** | | | | |
| UI-Venus-Navi-7B | ❌ | ❌ | βœ… | **49.1** |
| UI-Venus-Navi-72B | ❌ | ❌ | βœ… | **65.9** |

> **Table:** Performance comparison on **AndroidWorld** for end-to-end models. Our UI-Venus-Navi-72B achieves state-of-the-art performance, outperforming all baseline methods across different settings.

### Results on AndroidControl and GUI-Odyssey

| Models | AndroidControl-Low<br>Type Acc. | AndroidControl-Low<br>Step SR | AndroidControl-High<br>Type Acc. | AndroidControl-High<br>Step SR | GUI-Odyssey<br>Type Acc. | GUI-Odyssey<br>Step SR |
|--------|-------------------------------|-----------------------------|-------------------------------|-----------------------------|------------------------|----------------------|
| **Closed-source Models** | | | | | | |
| GPT-4o | 74.3 | 19.4 | 66.3 | 20.8 | 34.3 | 3.3 |
| **Open Source Models** | | | | | | |
| Qwen2.5-VL-7B | 94.1 | 85.0 | 75.1 | 62.9 | 59.5 | 46.3 |
| SeeClick | 93.0 | 75.0 | 82.9 | 59.1 | 71.0 | 53.9 |
| OS-Atlas-7B | 93.6 | 85.2 | 85.2 | 71.2 | 84.5 | 62.0 |
| Aguvis-7B| - | 80.5 | - | 61.5 | - | - |
| Aguvis-72B| - | 84.4 | - | 66.4 | - | - |
| OS-Genesis-7B | 90.7 | 74.2 | 66.2 | 44.5 | - | - |
| UI-TARS-7B| 98.0 | 90.8 | 83.7 | 72.5 | 94.6 | 87.0 |
| UI-TARS-72B| **98.1** | 91.3 | 85.2 | 74.7 | **95.4** | **88.6** |
| GUI-R1-7B| 85.2 | 66.5 | 71.6 | 51.7 | 65.5 | 38.8 |
| NaviMaster-7B | 85.6 | 69.9 | 72.9 | 54.0 | - | - |
| UI-AGILE-7B | 87.7 | 77.6 | 80.1 | 60.6 | - | - |
| AgentCPM-GUI | 94.4 | 90.2 | 77.7 | 69.2 | 90.0 | 75.0 |
| **Ours** | | | | | | |
| UI-Venus-Navi-7B | 97.1 | 92.4 | **86.5** | 76.1 | 87.3 | 71.5 |
| UI-Venus-Navi-72B | 96.7 | **92.9** | 85.9 | **77.2** | 87.2 | 72.4 |

> **Table:** Performance comparison on offline UI navigation datasets including AndroidControl and GUI-Odyssey. Note that models with * are reproduced.

# Citation
Please consider citing if you find our work useful:
```plain
@misc{gu2025uivenustechnicalreportbuilding,
      title={UI-Venus Technical Report: Building High-performance UI Agents with RFT}, 
      author={Zhangxuan Gu and Zhengwen Zeng and Zhenyu Xu and Xingran Zhou and Shuheng Shen and Yunfei Liu and Beitong Zhou and Changhua Meng and Tianyu Xia and Weizhi Chen and Yue Wen and Jingya Dou and Fei Tang and Jinzhen Lin and Yulin Liu and Zhenlin Guo and Yichen Gong and Heng Jia and Changlong Gao and Yuan Guo and Yong Deng and Zhenyu Guo and Liang Chen and Weiqiang Wang},
      year={2025},
      eprint={2508.10833},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.10833}, 
}
```