Abstract
VeriGUI is a novel dataset for evaluating GUI agents in long-horizon tasks, emphasizing long-chain complexity and subtask-level verifiability.
Recent studies have delved into constructing autonomous agents capable of performing complex Graphical User Interface (GUI)-based computer tasks, with the potential to revolutionize human-computer interaction. Despite encouraging results, existing efforts mainly focus on short-term interactions and rely on outcome-only verification, thereby limiting their scalability in real-world GUI applications that demand long-horizon task decomposition and execution. In this work, we introduce VeriGUI, a novel verifiable long-chain GUI dataset designed to facilitate the development and evaluation of generalist GUI agents operating in realistic computer environments. Our dataset emphasizes two critical dimensions: (1) long-chain complexity, with tasks decomposed into a sequence of interdependent subtasks spanning hundreds of steps, explicitly designed to allow any subtask to serve as a valid starting point; and (2) subtask-level verifiability, which enables diverse exploration strategies within each subtask, while ensuring that each subtask-level goal remains verifiable and consistent. The dataset consists of GUI task trajectories across both desktop and web, annotated by human experts. Extensive experiments on VeriGUI using various agents with different foundation models reveal significant performance gaps in handling long-horizon tasks, highlighting the need for more robust planning and decision-making capabilities in GUI agents.
Community
VeriGUI is a novel dataset for evaluating GUI agents in long-horizon tasks, emphasizing long-chain complexity and subtask-level verifiability.
More of our research will be published in the near future. We encourage everyone to stay tuned for our upcoming work and genuinely hope our contributions will benefit the broader community.
https://huggingface.co/datasets/2077AIDataFoundation/VeriGUI
VeriGUI is the first verifiable long-chain GUI dataset for general-purpose agents. It will undoubtedly push the boundaries of general-purpose agents to new heights. Thrilled to be a part of 2077 AI and explore the new future of agent data together!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents (2025)
- AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents (2025)
- NatureGAIA: Pushing the Frontiers of GUI Agents with a Challenging Benchmark and High-Quality Trajectory Dataset (2025)
- MobileUse: A GUI Agent with Hierarchical Reflection for Autonomous Mobile Operation (2025)
- CoAct-1: Computer-using Agents with Coding as Actions (2025)
- GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies (2025)
- MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper