|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- Qwen/Qwen2-VL-2B-Instruct |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
# InfiGUIAgent-2B-Stage1 |
|
|
|
This repository contains the **Stage 1 model** from the [InfiGUIAgent](https://arxiv.org/pdf/2501.04575) paper. The model is based on `Qwen2-VL-2B-Instruct` and enhanced with Supervised Fine-Tuning (SFT) on extensive GUI task data to improve fundamental GUI understanding capabilities. |
|
|
|
## Quick Start |
|
|
|
### Installation |
|
First install required dependencies: |
|
```bash |
|
pip install transformers qwen-vl-utils |
|
``` |
|
|
|
### GUI Element Localization Example |
|
```python |
|
import cv2 |
|
import json |
|
import torch |
|
import requests |
|
from PIL import Image |
|
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor |
|
from qwen_vl_utils import process_vision_info |
|
|
|
# Load model and processor |
|
model = Qwen2VLForConditionalGeneration.from_pretrained( |
|
"Reallm-Labs/InfiGUIAgent-2B-Stage1", |
|
torch_dtype=torch.bfloat16, |
|
attn_implementation="flash_attention_2", |
|
device_map="auto" |
|
) |
|
processor = AutoProcessor.from_pretrained("Reallm-Labs/InfiGUIAgent-2B-Stage1") |
|
|
|
# Prepare inputs |
|
img_url = "https://raw.githubusercontent.com/Reallm-Labs/InfiGUIAgent/main/images/test_img.png" |
|
prompt_template = """Output the relative coordinates of the icon, widget, or text most closely related to "{instruction}" in this screenshot, in the format of \"{{\"x\": x, \"y\": y}}\", where x and y are in the positive directions of horizontal left and vertical down respectively, with the origin at the top left corner, and the range is 0-1000.""" |
|
|
|
# Download image |
|
response = requests.get(img_url) |
|
with open("test_img.png", "wb") as f: |
|
f.write(response.content) |
|
|
|
# Build message template |
|
messages = [{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "image", "image": "test_img.png"}, |
|
{"type": "text", "text": prompt_template.format(instruction="View detailed storage space usage")}, |
|
] |
|
}] |
|
|
|
# Process and generate |
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
image_inputs, video_inputs = process_vision_info(messages) |
|
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda") |
|
generated_ids = model.generate(**inputs, max_new_tokens=128) |
|
output_text = processor.batch_decode( |
|
[out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)], |
|
skip_special_tokens=True, |
|
clean_up_tokenization_spaces=False |
|
)[0] |
|
|
|
# Visualize results |
|
try: |
|
coords = json.loads(output_text) |
|
img = cv2.imread("test_img.png") |
|
height, width = img.shape[:2] |
|
x = int(coords['x'] * width / 1000) |
|
y = int(coords['y'] * height / 1000) |
|
|
|
cv2.circle(img, (x, y), 10, (0, 0, 255), -1) |
|
cv2.putText(img, f"({coords['x']}, {coords['y']})", (x+10, y-10), |
|
cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0, 0, 255), 2) |
|
cv2.imwrite("output.png", img) |
|
except: |
|
print("Error: Failed to parse coordinates or process image") |
|
|
|
print("Predicted coordinates:", output_text) |
|
``` |
|
|
|
## Limitations |
|
This is a **Stage 1 model** focused on establishing fundamental GUI understanding capabilities. It may demonstrate suboptimal performance on: |
|
- Complex reasoning tasks |
|
- Multi-step operations |
|
- Abstract instruction following |
|
|
|
|
|
For more information, please refer to our [repo](https://github.com/Reallm-Labs/InfiGUIAgent). |
|
|
|
## Citation |
|
```bibtex |
|
@article{liu2025infiguiagent, |
|
title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection}, |
|
author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei}, |
|
journal={arXiv preprint arXiv:2501.04575}, |
|
year={2025} |
|
} |
|
``` |