Papers
arxiv:2507.22025

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

Published on Jul 29
· Submitted by LianShuQuan on Aug 11
Authors:
,
,
,
,
,

Abstract

UI-AGILE enhances GUI agents through improved training with a Continuous Reward function, Simple Thinking reward, and Cropping-based Resampling, and inference with Decomposed Grounding with Selection, achieving state-of-the-art performance on GUI benchmarks.

AI-generated summary

The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE, a comprehensive framework enhancing GUI agents at both the training and inference stages. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a Continuous Reward function to incentivize high-precision grounding; 2) a "Simple Thinking" reward to balance planning with speed and grounding accuracy; and 3) a Cropping-based Resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present Decomposed Grounding with Selection, a novel method that dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2. For instance, using both our proposed training and inference enhancement methods brings 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro.

Community

Paper author Paper submitter
edited 4 days ago

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

🔥 Overview

UI-AGILE enhances GUI agents through improved training with a Continuous Reward function, Simple Thinking reward, and Cropping-based Resampling, and inference with Decomposed Grounding with Selection.

UI-Crop-V1.drawio.png

Trained on about only 9k samples for just 2 epochs, UI-AGILE shows superior performance, while also showcasing strong general agent capabilities. Furthermore, our inference method can act as a plug-and-play enhancement for a wide range of existing agents, improving the accuracy of some existing open-source models.

As a baseline, the standard grounding approach applied to UI-AGILE-7B completes the benchmark in 30 minutes. When applying our method, the decomposed grounding stage takes 35 minutes. The subsequent VLM-based selection stage requires additional 4 minutes. The modest increase in overhead is a practical trade-off for the substantial gain of grounding accuracy brought by our method.

ScreenSpot-Pro.png

"Attempt Num Distribution" shows the distribution of attempts per GRPO training step, where each step processes a batch of two training samples. In the first epoch, we find that only 61.8% of training steps are fully successful on the initial attempt (i.e., both samples in the batch are solved without resampling). This means that without our strategy, a minimum of 19.1% (38.2% ÷ 2) of training samples would have provided no learning signal. Overall attempt numbers decreases in the second epoch, demonstrating that the model learns from the samples salvaged by our method.

attempt_num_ratio.png

Setup

We provide the code for our RFT training and the Decomposed Grounding with Selection method in two separate modules. To avoid potential dependency conflicts, each module is designed to be run in its own conda environment.

Inference

cd eval

To accelerate evaluation, we organize data as parquet and provide evaluation code.

You can easily adapt your models to our pipline.

eval/grounding/eval_grounding_vllm_no_ray.py is for grounding benchmarks (Screenspot-v2 and Screenspot-Pro).

eval/android_control/inference_android_control_refactored.py is for AndroidControl.

Training

cd train/src/scripts
bash train.sh

⭐️ Citation

If you find this project useful, welcome to cite us.



@misc
	{lian2025uiagileadvancingguiagents,
      title={UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding}, 
      author={Shuquan Lian and Yuhang Wu and Jia Ma and Zihan Song and Bingqi Chen and Xiawu Zheng and Hui Li},
      year={2025},
      eprint={2507.22025},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2507.22025}, 
}

🤝 Acknowledgements

We sincerely thank projects R1-V, Open-R1, and Open-r1-multimodal, VLM-R1 for providing their open-source resources.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.22025 in a Space README.md to link it from this page.

Collections including this paper 3