File size: 11,500 Bytes

---
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
---

### UI-Venus
This repository contains the UI-Venus model from the report [UI-Venus: Building High-performance UI Agents with RFT](https://arxiv.org/abs/2508.10833). 

UI-Venus is a native UI agent based on the Qwen2.5-VL multimodal large language model, designed to perform precise GUI element grounding and effective navigation using only screenshots as input. It achieves state-of-the-art performance through Reinforcement Fine-Tuning (RFT) with high-quality training data. More inference details and usage guides are available in the GitHub repository. We will continue to update results on standard benchmarks including Screenspot-v2/Pro and AndroidWorld.



[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Report](https://img.shields.io/badge/Report-Technical%20Report-blueviolet?logo=notion)](http://arxiv.org/abs/2508.10833)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-green?logo=github)](https://github.com/inclusionAI/UI-Venus)
[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Model-orange?logo=huggingface)](https://huggingface.co/inclusionAI/UI-Venus-Navi-7B)

---

 <p align="center">
  📈 UI-Venus Benchmark Performance
</p>

<p align="center">
  <img src="performance_venus.png" alt="UI-Venus Performance Across Datasets" width="1200" />
  <br>
</p>

> **Figure:** Performance of UI-Venus across multiple benchmark datasets. UI-Venus achieves **State-of-the-Art (SOTA)** results on key UI understanding and interaction benchmarks, including **ScreenSpot-Pro**, **ScreenSpot-v2**, **OS-World-G**, **UI-Vision**, and **Android World**. The results demonstrate its superior capability in visual grounding, UI navigation, cross-platform generalization, and complex task reasoning.

### Model Description

UI-Venus is a multimodal UI agent built on Qwen2.5-VL that performs accurate UI grounding and navigation using only screenshots as input. The 7B and 72B variants achieve 94.1%/50.8% and 95.3%/61.9% on Screenspot-V2 and Screenspot-Pro benchmarks, surpassing prior SOTA models such as GTA1 and UI-TARS-1.5. On the AndroidWorld navigation benchmark, they achieve 49.1% and 65.9% success rates, respectively, demonstrating strong planning and generalization capabilities

Key innovations include:
- **SOTA Open-Source UI Agent**: Publicly released to advance research in autonomous UI interaction and agent-based systems.
- **Reinforcement Fine-Tuning (RFT)**: Utilizes carefully designed reward functions for both grounding and navigation tasks
- **Efficient Data Cleaning**: Trained on several hundred thousand high-quality samples to ensure robustness.
- **Self-Evolving Trajectory History Alignment & Sparse Action Enhancement**: Improves reasoning coherence and action distribution for better long-horizon planning.

---
## Installation

First, install the required dependencies:

```python
pip install transformers==4.49.0 qwen-vl-utils
```
---
  
## Quick Start
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from typing import Dict, Tuple, Any
import torch
import os
import re
from qwen_vl_utils import process_vision_info

# -----------------------------
# Model & Tokenizer
# -----------------------------
MODEL_NAME = "inclusionAI/UI-Venus-Navi-72B"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
).eval()

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_NAME)

GENERATION_CONFIG = {
    "max_new_tokens": 2048,
    "do_sample": False,
    "temperature": 0.0,
}

# -----------------------------
# Prompt Template
# -----------------------------
PROMPT_TEMPLATE = """**You are a GUI Agent.**
Your task is to analyze a given user task, review current screenshot and previous actions, and determine the next action to complete the task.

### User Task
{user_task}

### Previous Actions
{previous_actions}

### Available Actions
Click(box=(x1, y1))
Drag(start=(x1, y1), end=(x2, y2))
Scroll(start=(x1, y1), end=(x2, y2), direction='down/up/right/left')
Type(content='')
Launch(app='')
Wait()
Finished(content='')
CallUser(content='')
LongPress(box=(x1, y1))
PressBack()
PressHome()
PressEnter()
PressRecent()

### Instruction
- Make sure you understand the task goal to avoid wrong actions.
- Examine the screenshot carefully. History may be unreliable.
- For user questions, reply with `CallUser`, then `Finished` if done.
- Explore screen content using scroll in different directions.
- Copy text: select → click `copy`.
- Paste text: long press text box → click `paste`.
- First reason inside <think>, then provide <action>, then summarize in <conclusion>.
"""

# -----------------------------
# Parse action
# -----------------------------
def parse_action(action_str: str) -> Tuple[str, Dict[str, Any]]:
    """Parse action string into action type + params."""
    pattern = r"^(\w+)\((.*)\)$"
    match = re.match(pattern, action_str.strip(), re.DOTALL)
    if not match:
        print(f"Invalid action type: {action_str}")
        return "", {}

    action_type, params_str = match.group(1), match.group(2).strip()
    params = {}

    if params_str:
        try:
            # split by comma not inside parentheses
            param_pairs = re.split(r",(?![^(]*\))", params_str)
            for pair in param_pairs:
                if "=" in pair:
                    key, value = pair.split("=", 1)
                    params[key.strip()] = value.strip().strip("'").strip()
                else:
                    params[pair.strip()] = None
        except Exception as e:
            print(f"Parse param failed: {e}")
            return action_type, {}
    return action_type, params


def extract_tag(content: str, tag: str) -> str:
    """Extract latest <tag>...</tag> content from model output."""
    pattern = fr"<{tag}>(.*?)</{tag}>"
    matches = list(re.finditer(pattern, content, re.DOTALL))
    if not matches:
        print(f"{tag} Not Found")
        return ""
    return matches[-1].group(1).strip()

# -----------------------------
# Inference
# -----------------------------
def inference(image_path: str, goal: str) -> Dict[str, str]:
    if not (os.path.exists(image_path) and os.path.isfile(image_path)):
        raise FileNotFoundError(f"Invalid input image path: {image_path}")

    full_prompt = PROMPT_TEMPLATE.format(user_task=goal, previous_actions="")

    messages = [{
        "role": "user",
        "content": [
            {"type": "text", "text": full_prompt},
            {"type": "image", "image": image_path, "min_pixels": 3136, "max_pixels": 12845056},
        ],
    }]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)

    model_inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt"
    ).to(model.device)

    generated_ids = model.generate(**model_inputs, **GENERATION_CONFIG)
    generated_ids_trimmed = [out[len(inp):] for inp, out in zip(model_inputs.input_ids, generated_ids)]
    output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0]

    return {
        "raw_response": output_text,
        "think": extract_tag(output_text, "think"),
        "action": extract_tag(output_text, "action"),
        "conclusion": extract_tag(output_text, "conclusion"),
    }
```

### Usage
⚠️ For action types that include coordinates (e.g., click, scroll),  
the current code does **not** handle coordinate conversion.  
You need to map the coordinates back to the original image space using `max_pixels` and `min_pixels` before applying them.

---
### Results on AndroidWorld
This is the compressed package of validation trajectories for **AndroidWorld**, including execution logs and navigation paths.  
📥 Download: [UI-Venus-androidworld.zip](https://github.com/inclusionAI/UI-Venus)

| Models | With Planner | A11y Tree | Screenshot | Success Rate (pass@1) |
|--------|--------------|-----------|------------|------------------------|
| **Closed-source Models** | | | | |
| GPT-4o| ❌ | ✅ | ❌ | 30.6 |
| ScaleTrack| ❌ | ✅ | ❌ | 44.0 |
| SeedVL-1.5 | ❌ | ✅ | ✅ | 62.1 |
| UI-TARS-1.5 | ❌ | ❌ | ✅ | 64.2 |
| **Open-source Models** | | | | |
| GUI-Critic-R1-7B | ❌ | ✅ | ✅ | 27.6 |
| Qwen2.5-VL-72B* | ❌ | ❌ | ✅ | 35.0 |
| UGround | ✅ | ❌ | ✅ | 44.0 |
| Aria-UI | ✅ | ❌ | ✅ | 44.8 |
| UI-TARS-72B | ❌ | ❌ | ✅ | 46.6 |
| GLM-4.5v | ❌ | ❌ | ✅ | 57.0 |
| **Ours** | | | | |
| UI-Venus-Navi-7B | ❌ | ❌ | ✅ | **49.1** |
| UI-Venus-Navi-72B | ❌ | ❌ | ✅ | **65.9** |

> **Table:** Performance comparison on **AndroidWorld** for end-to-end models. Our UI-Venus-Navi-72B achieves state-of-the-art performance, outperforming all baseline methods across different settings.

### Results on AndroidControl and GUI-Odyssey

| Models | AndroidControl-Low<br>Type Acc. | AndroidControl-Low<br>Step SR | AndroidControl-High<br>Type Acc. | AndroidControl-High<br>Step SR | GUI-Odyssey<br>Type Acc. | GUI-Odyssey<br>Step SR |
|--------|-------------------------------|-----------------------------|-------------------------------|-----------------------------|------------------------|----------------------|
| **Closed-source Models** | | | | | | |
| GPT-4o | 74.3 | 19.4 | 66.3 | 20.8 | 34.3 | 3.3 |
| **Open Source Models** | | | | | | |
| Qwen2.5-VL-7B | 94.1 | 85.0 | 75.1 | 62.9 | 59.5 | 46.3 |
| SeeClick | 93.0 | 75.0 | 82.9 | 59.1 | 71.0 | 53.9 |
| OS-Atlas-7B | 93.6 | 85.2 | 85.2 | 71.2 | 84.5 | 62.0 |
| Aguvis-7B| - | 80.5 | - | 61.5 | - | - |
| Aguvis-72B| - | 84.4 | - | 66.4 | - | - |
| OS-Genesis-7B | 90.7 | 74.2 | 66.2 | 44.5 | - | - |
| UI-TARS-7B| 98.0 | 90.8 | 83.7 | 72.5 | 94.6 | 87.0 |
| UI-TARS-72B| **98.1** | 91.3 | 85.2 | 74.7 | **95.4** | **88.6** |
| GUI-R1-7B| 85.2 | 66.5 | 71.6 | 51.7 | 65.5 | 38.8 |
| NaviMaster-7B | 85.6 | 69.9 | 72.9 | 54.0 | - | - |
| UI-AGILE-7B | 87.7 | 77.6 | 80.1 | 60.6 | - | - |
| AgentCPM-GUI | 94.4 | 90.2 | 77.7 | 69.2 | 90.0 | 75.0 |
| **Ours** | | | | | | |
| UI-Venus-Navi-7B | 97.1 | 92.4 | **86.5** | 76.1 | 87.3 | 71.5 |
| UI-Venus-Navi-72B | 96.7 | **92.9** | 85.9 | **77.2** | 87.2 | 72.4 |

> **Table:** Performance comparison on offline UI navigation datasets including AndroidControl and GUI-Odyssey. Note that models with * are reproduced.

# Citation
Please consider citing if you find our work useful:
```plain
@misc{gu2025uivenustechnicalreportbuilding,
      title={UI-Venus Technical Report: Building High-performance UI Agents with RFT}, 
      author={Zhangxuan Gu and Zhengwen Zeng and Zhenyu Xu and Xingran Zhou and Shuheng Shen and Yunfei Liu and Beitong Zhou and Changhua Meng and Tianyu Xia and Weizhi Chen and Yue Wen and Jingya Dou and Fei Tang and Jinzhen Lin and Yulin Liu and Zhenlin Guo and Yichen Gong and Heng Jia and Changlong Gao and Yuan Guo and Yong Deng and Zhenyu Guo and Liang Chen and Weiqiang Wang},
      year={2025},
      eprint={2508.10833},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.10833}, 
}
```