GRPO for GUI Grounding Done Right

Community Article Published June 11, 2025

Estimated reading time: 8 minutes

Reinforcement learning (RL) (e.g., GRPO) helps with grounding because of its inherent objective alignment—rewarding any successful clicks—rather than encouraging long textual Chain-of-Thought (CoT) reasoning (also referred to as “thinking”). In this blog, we will share a complete recipe towards training state-of-the-art GUI grounding models using GRPO.
Authors:
Yan Yang, Dongxu Li, Yuhao Yang, Ziyang Luo, Yutong Dai, Zeyuan Chen, Ran Xu, Liyuan Pan, Caiming Xiong, Junnan Li
Affiliations:
Salesforce Research, The Australian National University, University of Hong Kong, Beijing Institute of Technology

🤔 What is GUI Grounding?

When a graphical user interface (GUI) agent performs a task on behalf of the user, one critical step is grounding, which determines the coordinate to "click" on the UI based on the user instruction. Formally, the task predicts a (x,y) (x,y) coordinate on a GUI screenshot image in response to a textual instruction. The goal is to identify and click the correct target element, such as a button, link, or icon, based on the user intent. Below, we provide a specific example.

A GUI grounding example. The user instruction is "Click menu", and the model is expected to predict the center (x,y)(x, y) coordinate of the three-dot icon (highlighted with a red circle) in the top-right corner of the screenshot image.

🧪 Why GRPO?

In GUI grounding, any click within the target element is considered a correct prediction, meaning the output coordinate (x,y)(x, y) only needs to fall inside the correct element region. Unlike Supervised Fine-Tuning (SFT), which rigidly trains the model to predict the exact center of the target element, Group Relative Policy Optimization (GRPO) adopts a more flexible approach. It learns to optimize for successful actions, accepting a range of valid clicks within the target area, which better aligns with how real user interaction behaves.

An overview comparing SFT and GRPO. Left: SFT enforces prediction of only the center of the target element (indicated by the a red star). Right: GRPO samples multiple predictions (indicated by the red stars) and rewards any that fall within the target element region (the blue bounding box), rather than requiring an exact center. Incorrect predictions outside the blue bounding box are penalized equally.

Conventionally, when training a model with GRPO, the model is prompted to reason about the instruction and the image before producing a final answer with an expected output format as:

<think> the textual reasoning process </think>
<answer> the answer to the user instruction </answer>

We refer to the reasoning process enclosed within the <think> tags as the textual Chain-of-Thought (CoT) and "thinking".


📦 GUI Grounding Dataset

To train the model with GRPO effectively, we need a dataset containing:

  • Instruction;
  • GUI image;
  • Target element bounding box (i.e., valid click region).

For example,

 {
  "instruction": "Click menu",
  "image_path": "images/screenshot_001.png",
  "target_bbox": {
    "x_min": 12,
    "y_min": 10,
    "x_max": 42,
    "y_max": 40
  }
}

There are usually three main types of training data:

  • Mobile (e.g., Android or iOS apps);
  • Desktop (e.g., Windows, Linux applications);
  • Web (e.g., browser-based interfaces).

For desktop and web datasets, the data is generally collected via screenshots alongside accessibility tools like A11y or HTML parsers to extract element structure and bounding boxes. However, these bounding boxes may sometimes be misaligned with the visual rendering due to UI animations or timing inconsistencies. In our work, we primarily rely on datasets curated from Aria-UI and OS-Atlas, which we found to be cleaner and better aligned than alternative data collections.

To further improve data quality, we apply a lightweight cleaning strategy:

  • Detect all elements on the screenshot using OmniParser;
  • Calculate the maximum Intersection over Union (IoU) between each annotated bounding box and the detected element;
  • Filter out samples where the target bounding box falls below a predefined IoU threshold.

This helps ensure that training data remains consistent with actual visual targets, reducing noise from misaligned annotations. While this method may occasionally filter out a small number of false positives, we find such cases account for less than 3% of the data. Refer to our code for details.

Examples from the Aria-UI dataset collection. The blue bounding box shows the derived annotation, while the red bounding boxes are detected using OmniParser. A large green arrow is used to draw attention to the misaligned blue bounding box. Our lightweight cleaning strategy filters out such cases where the annotation does not match the actual UI element.

🛠️ Model Training

We use various open-source models as baseline models (e.g., UI-TARS and Qwen2.5-VL), scaling from 7B to 32B and 72B parameters, and train it with VLM-R1 codebase. The training process can be completed around in approximately 800 H100 GPU-hours over 250 optimization iterations. Here, we share key insights and lessons learned during the training process.

  • "Thinking" is not required to achieve strong grounding performance with GRPO. The effectiveness of GRPO primarily comes from its objective alignment—rewarding successful clicks regardless of how they are expressed. In fact, avoiding both "thinking" and KL regularization often leads to more flexible and accurate coordinate predictions. We’ll discuss the trade-offs of using "thinking" in more detail later—it tends to help only in specific scenarios.
  • Click-based rewards are sufficient. We experimented with various reward functions (e.g., MSE-based, IoU-based, format rewards for "thinking", and so on). A simple reward that checks whether the predicted point falls inside the target region is enough to achieve strong performance.
  • For both “thinking” and “non-thinking” GRPO, performing SFT as a cold start is unnecessary. Qwen2.5-VL and UI-TARS is already sufficiently strong, and SFT prior to GRPO does not yield significant improvements in grounding performance.
  • Using a batch size larger than 128. Smaller batches (e.g., 16 or 32) can lead to training instability. For example, if a batch contains only entirely correct or incorrect samples, the reward signal may vanish, cuasing model collapse.
  • Sampling 8 responses per instruction is generally sufficient to achieve strong performance. Increasing this number yields diminishing returns. (Important: During sampling, make sure to add “bad words” (i.e., banned tokens) to prevent the model from generating <img> tokens. On diverse datasets and longer training runs, forgetting this can lead to alignment issues or spurious behavior related to image token generation.)
  • KL divergence with a reference model is not necessary. Qwen2.5-VL performs strongly on the mobile domain in general. While adding a KL penalty may help retain performance in the mobile setting, it tends to limit exploration in the desktop and web domains.
  • The model is not sensitive to learning rate. A peak learning rate of 1e-6 generally works well in most settings.
A plot showing reward (y-axis, 0-1 range) variation over optimization iterations (x-axis) visualized from TensorBoard.

📈 How the Model Perform?

We follow the standard evaluation protocol and benchmark our model on three challenging datasets. Our method consistently achieves the best results among all open-source model families. Below are the comparative results:

Model Size Open Source ScreenSpot-V2 ScreenSpotPro OSWORLD-G
OpenAI CUA 87.9 23.4
Claude 3.7 87.6 27.7
JEDI-7B 7B 91.7 39.5 54.1
SE-GUI 7B 90.3 47.0
UI-TARS 7B 91.6 35.7 47.5
UI-TARS-1.5* 7B 89.7* 42.0* 64.2*
UGround-v1-7B 7B 31.1 36.4
Qwen2.5-VL-32B-Instruct 32B 91.9* 48.0 59.6*
UGround-v1-72B 72B 34.5
Qwen2.5-VL-72B-Instruct 72B 94.00* 53.3 62.2*
UI-TARS 72B 90.3 38.1
Grounding-R1 (Ours) 7B 92.4 (∆ +2.7) 50.1(∆ +8.1) 67.7 (∆ +3.5)
Grounding-R1 (Ours) 32B 93.2 (∆ +1.3) 53.6 (∆ +5.6) 61.9(∆ +2.3)
Grounding-R1 (Ours) 72B 94.8(∆ +0.8) 58.4 (∆ +5.1) 66.7(∆ +4.5)

Note:

  • Model size is indicated in billions (B) of parameters.
  • A dash (—) denotes results that are currently unavailable.
  • A superscript asterisk (﹡) denotes our evaluated result.
  • UI-TARS-1.5 7B, Qwen2.5-VL-32B-Instruct, and Qwen2.5-VL-72B-Instruct are applied as our baseline models.
  • ∆ indicates the performance improvement (∆) of our model compared to its baseline.

🤔 When "thinking" Help?

Across various static benchmarks, we observe minimal performance differences, usually within 0.5% performance variation between models trained with and without “thinking”. However, they often succeed on different samples, likely due to training instability rather than systematic reasoning gains. Below, we present several examples where either the “thinking” or non-“thinking” model is correct, but not both.

Examples where "thinking" fails: the blue star ("thinking" prediction) is incorrect, while the red star (non-"thinking" prediction) falls within the green bounding box, correctly hitting the target region.
Examples where "thinking" helps: the blue star ("thinking" prediction) is correctly inside the green bounding box, while the red star (non-"thinking" prediction) is incorrect and outside the valid target region.

However, we find that "thinking" can be effective in dynamic environments such as AndroidWorld, where the model is provided with the task object, past trajectories, and the user instruction. For example, we trained an in-domain 7B model using the AndroidControl dataset. While grounding performance was similar on the AndroidControl test fold, the task success rate on AndroidWorld increased from 39% to 44% when using "thinking". This improvement is attributed to the increased complexity of the textual prompts (i.e., combination of task object, past trajectories, and the user instruction), which encourages the model to engage in "thinking" when operating under challenging and dynamic conditions.


💬 How SFT compared with GRPO?

To answer this, we trained a 7B model using both SFT and GRPO on the same dataset. The SFT model achieved 90.2 on ScreenSpot-V2 and 42.5 on ScreenSpotPro. In contrast, the GRPO-trained model reached 92.4 and 50.1, respectively. This shows that GRPO offers a substantial performance boost. However, it's important to note that GRPO tends to bring significant improvements only when the base model already exhibits reasonably good performance. If the baseline is too weak, GRPO may struggle due to insufficient reward signal.

Community

deleted
This comment has been hidden

Thank you for this insightful blog post ! We really appreciate your thorough analysis and the comprehensive comparison you've provided.
We at H Company are training Vision Language Action Models, specializing in GUI grounding and we recently published a technical report (https://huggingface.co/papers/2506.02865) and a blog post (https://huggingface.co/blog/Hcompany/holo1) that include some relevant findings that might complement your work. Our model weights and benchmark are also available on Hugging Face for reproducibility. Would you consider incorporating the numbers from our report into your analysis and updating the corresponding tables ? We believe this could provide readers with an even more complete picture of the current landscape.
Additionally, if possible, it would be wonderful if you could include results on our WebClick benchmark (https://huggingface.co/datasets/Hcompany/WebClick), as we think this would add valuable context to your comparison.
We'd be happy to provide any additional details or clarification about our methodology if that would be helpful. Thank you again for your work, and we look forward to seeing how the field continues to evolve !

Sign up or log in to comment