Update README.md

659c038 verified 6 months ago

3.77 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen2-VL-2B-Instruct
	pipeline_tag: image-text-to-text
	---

	# InfiGUIAgent-2B-Stage1

	This repository contains the Stage 1 model from the [InfiGUIAgent](https://arxiv.org/pdf/2501.04575) paper. The model is based on `Qwen2-VL-2B-Instruct` and enhanced with Supervised Fine-Tuning (SFT) on extensive GUI task data to improve fundamental GUI understanding capabilities.

	## Quick Start

	### Installation
	First install required dependencies:
	```bash
	pip install transformers qwen-vl-utils
	```

	### GUI Element Localization Example
	```python
	import cv2
	import json
	import torch
	import requests
	from PIL import Image
	from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
	from qwen_vl_utils import process_vision_info

	# Load model and processor
	model = Qwen2VLForConditionalGeneration.from_pretrained(
	"Reallm-Labs/InfiGUIAgent-2B-Stage1",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	device_map="auto"
	)
	processor = AutoProcessor.from_pretrained("Reallm-Labs/InfiGUIAgent-2B-Stage1")

	# Prepare inputs
	img_url = "https://raw.githubusercontent.com/Reallm-Labs/InfiGUIAgent/main/images/test_img.png"
	prompt_template = """Output the relative coordinates of the icon, widget, or text most closely related to "{instruction}" in this screenshot, in the format of \"{{\"x\": x, \"y\": y}}\", where x and y are in the positive directions of horizontal left and vertical down respectively, with the origin at the top left corner, and the range is 0-1000."""

	# Download image
	response = requests.get(img_url)
	with open("test_img.png", "wb") as f:
	f.write(response.content)

	# Build message template
	messages = [{
	"role": "user",
	"content": [
	{"type": "image", "image": "test_img.png"},
	{"type": "text", "text": prompt_template.format(instruction="View detailed storage space usage")},
	]
	}]

	# Process and generate
	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
	generated_ids = model.generate(**inputs, max_new_tokens=128)
	output_text = processor.batch_decode(
	[out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)],
	skip_special_tokens=True,
	clean_up_tokenization_spaces=False
	)[0]

	# Visualize results
	try:
	coords = json.loads(output_text)
	img = cv2.imread("test_img.png")
	height, width = img.shape[:2]
	x = int(coords['x'] * width / 1000)
	y = int(coords['y'] * height / 1000)

	cv2.circle(img, (x, y), 10, (0, 0, 255), -1)
	cv2.putText(img, f"({coords['x']}, {coords['y']})", (x+10, y-10),
	cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0, 0, 255), 2)
	cv2.imwrite("output.png", img)
	except:
	print("Error: Failed to parse coordinates or process image")

	print("Predicted coordinates:", output_text)
	```

	## Limitations
	This is a Stage 1 model focused on establishing fundamental GUI understanding capabilities. It may demonstrate suboptimal performance on:
	- Complex reasoning tasks
	- Multi-step operations
	- Abstract instruction following


	For more information, please refer to our [repo](https://github.com/Reallm-Labs/InfiGUIAgent).

	## Citation
	```bibtex
	@article{liu2025infiguiagent,
	title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
	author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei},
	journal={arXiv preprint arXiv:2501.04575},
	year={2025}
	}
	```