WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Abstract
WebWatcher, a multimodal agent with enhanced visual-language reasoning, outperforms existing agents in complex visual and textual information retrieval tasks using synthetic trajectories and reinforcement learning.
Web agents such as Deep Research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However, most research remains primarily text-centric, overlooking visual information in the real world. This makes multimodal Deep Research highly challenging, as such agents require much stronger reasoning abilities in perception, logic, knowledge, and the use of more sophisticated tools compared to text-based agents. To address this limitation, we introduce WebWatcher, a multi-modal Agent for Deep Research equipped with enhanced visual-language reasoning capabilities. It leverages high-quality synthetic multimodal trajectories for efficient cold start training, utilizes various tools for deep reasoning, and further enhances generalization through reinforcement learning. To better evaluate the capabilities of multimodal agents, we propose BrowseComp-VL, a benchmark with BrowseComp-style that requires complex information retrieval involving both visual and textual information. Experimental results show that WebWatcher significantly outperforms proprietary baseline, RAG workflow and open-source agents in four challenging VQA benchmarks, which paves the way for solving complex multimodal information-seeking tasks.
Community
π In this paper, we introduce WebWatcher, a multimodal agent for deep research that possesses enhanced visual-language reasoning capabilities. Our work presents a unified framework that combines complex vision-language reasoning with multi-tool interaction.
Key features of our approach include:
- BrowseComp-VL Benchmark: We propose a new benchmark, BrowseComp-VL, to evaluate the capabilities of multimodal agents. This challenging dataset is designed for in-depth multimodal reasoning and strategic planning, mirroring the complexity of BrowseComp but extending it into the visual domain. It emphasizes tasks that require both visual perception and advanced information-gathering abilities.
- Automated Trajectory Generation: To provide robust tool-use capabilities, we developed an automated pipeline to generate high-quality, multi-step reasoning trajectories. These trajectories, which are grounded in actual tool-use behavior and reflect procedural decision-making, are used for efficient cold-start training and further optimization via reinforcement learning. The agent is equipped with several tools, including Web Image Search, Web Text Search, Webpage Visit, Code Interpreter, and an internal OCR tool.
- Superior Performance: WebWatcher significantly outperforms proprietary baselines, RAG workflows, and other open-source agents across four challenging VQA benchmarks: Humanity's Last Exam (HLE)-VL, BrowseComp-VL, LiveVQA, and MMSearch. The WebWatcher-32B model, in particular, achieves an average score of 18.2% on HLE, surpassing the GPT-4o-based OmniSearch baseline. It also achieves top-tier performance on LiveVQA (58.7%) and MMSearch (55.3%), demonstrating stable and superior results on demanding, real-world visual search benchmarks.
- We will release our code and the BrowseComp-VL dataset at https://github.com/Alibaba-NLP/WebAgent. Stay tunedπ₯!
π₯³
arXiv explained breakdown of this paper π https://arxivexplained.com/papers/webwatcher-breaking-new-frontier-of-vision-language-deep-research-agent
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- WebSailor: Navigating Super-human Reasoning for Web Agent (2025)
- MMSearch-R1: Incentivizing LMMs to Search (2025)
- From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents (2025)
- MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning (2025)
- A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges (2025)
- Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training (2025)
- BrowseMaster: Towards Scalable Web Browsing via Tool-Augmented Programmatic Agent Pair (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper