DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks
Abstract
DeepResearch Arena, a benchmark using academic seminar transcripts, provides high-quality research tasks to evaluate deep research agents across multiple disciplines.
Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers' attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.
Community
DeepResearch Arena is a seminar-grounded benchmark of over 10,000 research tasks across 12 disciplines, automatically constructed via a multi-agent system to evaluate deep research agents on authentic, traceable, and challenging research workflows.
A project page or github link maybe?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry (2025)
- ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks (2025)
- DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis (2025)
- WideSearch: Benchmarking Agentic Broad Info-Seeking (2025)
- WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent (2025)
- DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery (2025)
- EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper