Test-Time Reinforcement Learning for GUI Grounding via Region Consistency
Abstract
GUI-RC and GUI-RCPO enhance GUI grounding accuracy by leveraging spatial consistency and reinforcement learning without additional training data.
Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.
Community
We are happy to introduce GUI-RC and GUI-RCPO, methods that improve GUI grounding accuracy by leveraging spatial consistency in model predictions without additional training data, boosting performance by 2-5% through test-time consensus voting and self-supervised reinforcement learning.
Project Page: https://zju-real.github.io/gui-rcpo/
Github: https://github.com/zju-real/gui-rcpo
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding (2025)
- InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization (2025)
- UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding (2025)
- GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning (2025)
- R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding (2025)
- NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks (2025)
- GTA1: GUI Test-time Scaling Agent (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper