Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents
Abstract
A framework for web agents decomposes their capabilities into knowledge content learning and cognitive processes, using a structured dataset and a novel reasoning framework to enhance generalization and performance.
Multimodal large-scale models have significantly advanced the development of web agents, enabling perception and interaction with digital environments akin to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to effectively engage in cognitive reasoning. Therefore, we decompose a web agent's capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose Web-CogKnowledge Framework, categorizing knowledge as Factual, Conceptual, and Procedural. In this framework, knowledge content learning corresponds to the agent's processes of Memorizing and Understanding, which rely on the first two knowledge types, representing the "what" of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the "how" of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to systematically instill core knowledge necessary for web agent. This dataset serves as the agent's conceptual grounding-the "nouns" upon which comprehension is built-as well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, especially in generalizing to unseen tasks where structured knowledge is decisive. To enable rigorous evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities. Our code and data is open sourced at https://github.com/Gnonymous/Web-CogReasoner
Community
Web-CogReasoner introduces a paradigm shift from simply enhancing web agents to systematically building their cognitive abilities from the ground up. Inspired by Bloomโs Taxonomy, we decomposes agent capabilities into knowledge content learning (Factual, Conceptual) and cognitive processes (Procedural), enabling interpretable and goal-directed behavior. Built upon large multimodal models, it performs knowledge-driven Chain-of-Thought (CoT) reasoning across complex web tasks, where each reasoning step is transparently grounded in a specific knowledge type, ensuring both interpretability and robust generalization.
To support this, we introduce:
Web-CogKnowledge Framework: A Bloom's Taxonomy-inspired two-stage training paradigm (Knowledge Content Learning โ Cognitive Reasoning) for enhancing web agents' cognitive abilities.
Web-CogReasoner: A knowledge-driven multimodal agent trained via imitation learning in our Web-CogDataset.
Web-CogDataset: A curriculum-style dataset with 12 fine-grained tasks across 3 knowledge levels (Factual, Conceptual, Procedural), enabling stepwise skill acquisition.
Web-CogBench: A dedicated benchmark for evaluating whether a web agent possesses the requisite prior knowledge and cognitive capabilities for effective web navigation.
Related Links
๐ arXiv: https://arxiv.org/abs/2508.01858
๐ Code: https://github.com/Gnonymous/Web-CogReasoner
๐ค Models: https://huggingface.co/Gnonymous/Web-CogReasoner
๐ค Dataset: https://huggingface.co/datasets/Gnonymous/Web-CogDataset
๐ฌ Blogs: https://Gnonymous.github.io/blogs/Web-CogReasoner
๐ Homepage: https://Gnonymous.github.io/Web-CogReasoner
Citation
@ article{guo2025web,
title={Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents},
author={Guo, Yuhan and Guo, Cong and Sun, Aiwen and He, Hongliang and Yang, Xinyu and Lu, Yue and Zhang, Yingji and Guo, Xuntao and Zhang, Dong and Liu, Jianzhuang and others},
journal={arXiv preprint arXiv:2508.01858},
year={2025}
}
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning, Reasoning, Refinement: A Framework for Kahneman's Dual-System Intelligence in GUI Agents (2025)
- Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence (2025)
- Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models (2025)
- WebSynthesis: World-Model-Guided MCTS for Efficient WebUI-Trajectory Synthesis (2025)
- MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning (2025)
- MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents (2025)
- GTA1: GUI Test-time Scaling Agent (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper