BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
Abstract
As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.
Community
š« Excited to share our recent work: BrowseComp-ZH, the first high-difficulty benchmark specifically designed to evaluate large language models (LLMs) on Chinese web browsing tasks.
BrowseComp-ZH serves as a critical testbed for assessing:
Reasoning-augmented LLMs
Agent-based search systems
Retrieval-augmented generation (RAG) in non-English contexts
We constructed 289 multi-constraint questions across 11 domains (e.g., Film, Art, History, Medicine), each reverse-engineered from a factual answer and validated through a rigorous two-stage quality control process.
š Despite strong performance on existing benchmarks, mainstream models struggled significantly on BrowseComp-ZH:
1ļøā£ GPT-4o: 6.2% accuracy
2ļøā£ Most models scored below 10%
3ļøā£ Even the best-performing system, OpenAI DeepResearch, achieved only 42.9%
Why is this benchmark so challenging?
ā Chinese web content is highly fragmented across platforms
ā Tasks demand multi-hop reasoning and cross-page synthesis
This work is a collaboration between HKUST (Guangzhou), Peking University, Zhejiang University, Alibaba, ByteDance, NIO, and others. We hope it contributes to advancing multilingual, tool-using LLM agents and inspires further research in Chinese web intelligence.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper