Papers
arxiv:2506.01952

WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

Published on Jun 2
Β· Submitted by AtsuMiyai on Jun 3
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

WebChoreArena, a new benchmark comprising 532 tasks, extends the scope of WebArena to more complex and tedious web browsing tasks, measuring advancements in LLM capabilities.

AI-generated summary

Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress. Our experimental results demonstrate that as LLMs evolve, represented by GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, significant improvements in performance are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with Gemini 2.5 Pro, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.

Community

Paper author Paper submitter

πŸ§™β€β™‚οΈ Imagine web agents that don’t just browse but handle your tedious digital chores!

πŸ“£ Our team developed WebChoreArena

  • 532 human-curated tasks, crafted over 300+ hours
  • Tests agents on massive information memorization, mathematical reasoning, and long-term memory
  • built on WebArena with full reproducibility

πŸ“Š Even Gemini 2.5 Pro shows substantial room for improvement, highlighting a critical challenge for the next LLMs-based web agents!

🌐 https://webchorearena.github.io
πŸ“• https://arxiv.org/abs/2506.01952

Β·

Nice extension to WebArena!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.01952 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.01952 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.01952 in a Space README.md to link it from this page.

Collections including this paper 1