How to Train Your LLM Web Agent: A Statistical Diagnosis

Community Article Published July 8, 2025

Link to full paper

TL;DR: We ran the first large-scale study of compute–performance tradeoffs for open-source LLM web agents. We show that combining supervised fine-tuning (SFT) with reinforcement learning (RL) is the only strategy that closes the gap with closed-source agents like GPT-4o. Our work offers a compute-efficient, statistically grounded blueprint for training open web agents that can actually reason through multi-step tasks

LLM agents are great at solving single-step tasks like math and code. But real-world workflows — booking flights, filling forms, querying dashboards — require multiple steps, long-horizon reasoning, and brittle environments.

That’s where most agents break.

To help close this gap, we evaluate agents on two settings:

  • MiniWoB++: web UI tasks with sparse rewards
  • WorkArena++: enterprise-grade multi-page tasks from real knowledge work

MiniWob++

Liu et al. (2018). Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration. image/gif

WorkArena

Drouin et al. (2024). WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?.

image/png

Training LLM agents isn’t just challenging—it’s expensive. There are two main approaches to train LLM based agents—either with supervised fine-tuning (SFT) on expert trajectories or using on-policy reinforcement learning (RL). We show that both methods can work to some degree. But something that has been heavily overlooked is: how should we allocate compute between SFT and RL to get the best of both worlds?

Finding the right compute allocation wasn’t easy—hyperparameters behave differently depending on how much SFT warmup is used, making tuning across setups expensive. To address this, we ran 1,370 SFT+RL configurations and used a bootstrapping technique to identify robust hyperparameter choices.

With these hyperparameters in hand, we show that the hybrid approach is consistently best—outperforming both raw SFT and pure on-policy RL across both benchmarks. Crucially, we find that branching into RL early—but not immediately—unlocks the best performance–compute tradeoff. On MiniWoB++, this strategy matches the peak performance of pure SFT using just 55% of the compute—and even surpasses it in some settings.

image/png Additionally, this strategy also obtains the best performance on both WorkArena and Miniwob++.

image/png

Our hyperparameter analysis revealed several consistent patterns. Decoding temperature had the most impact overall, with 0.25 emerging as the sweet spot across settings. GRPO’s group-relative advantage proved helpful, but only after some SFT warmup—using it too early actually hurt performance. Similarly, curriculum learning boosted performance when RL was cold-started, but became counterproductive once the model had already been warm-started. And while trust region clipping stabilized training under heavy SFT, it offered little benefit—and sometimes slowed learning—when used without it.

image/png

Overall we provide an effective recipe for training LLM-based web agents that can outperform SFT on expert trajectories while using significantly less compute. Additionally, our bootstrap analysis on our random search provides insightful takeaways for what works and what doesn't when training web-agents. Overall, our findings offer a reproducible, budget-aware blueprint for advancing open-source LLM web agents in complex multi-step environments.

Community

Sign up or log in to comment