When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research
Abstract
Evaluation of LLMs on an academic manuscript verification dataset (SPOT) shows poor recall, precision, and reliability, indicating significant limitations in current AI's ability to replace human verification in scientific research.
Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists. To date, prior work casts these systems as generative co-authors responsible for crafting hypotheses, synthesizing code, or drafting manuscripts. In this work, we explore a complementary application: using LLMs as verifiers to automate the academic verification of scientific manuscripts. To that end, we introduce SPOT, a dataset of 83 published papers paired with 91 errors significant enough to prompt errata or retraction, cross-validated with actual authors and human annotators. Evaluating state-of-the-art LLMs on SPOT, we find that none surpasses 21.1\% recall or 6.1\% precision (o3 achieves the best scores, with all others near zero). Furthermore, confidence estimates are uniformly low, and across eight independent runs, models rarely rediscover the same errors, undermining their reliability. Finally, qualitative analysis with domain experts reveals that even the strongest models make mistakes resembling student-level misconceptions derived from misunderstandings. These findings highlight the substantial gap between current LLM capabilities and the requirements for dependable AI-assisted academic verification.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research (2025)
- YourBench: Easy Custom Evaluation Sets for Everyone (2025)
- CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers? (2025)
- ArxivBench: Can LLMs Assist Researchers in Conducting Research? (2025)
- LongCodeBench: Evaluating Coding LLMs at 1M Context Windows (2025)
- LLMs Outperform Experts on Challenging Biology Benchmarks (2025)
- SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper