SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Abstract
SealQA evaluates search-augmented language models' performance on fact-seeking questions with conflicting or noisy search results, revealing limitations in reasoning and factual accuracy.
We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.
Community
SealQA: A challenge benchmark for retrieval-augmented generation / too-use LLMs, where questions trigger conflicting, ambiguous, or unhelpful web search results.
Key takeaways:
- Frontier LLMs struggle on Seal-0 (SealQA's core set), where most chat LLMs (incl. GPT-4.1 w/ browsing) achieve near-zero accuracy.
- More test-time compute does not yield reliable gains: o-series models often plateau or decline early.
- Advanced reasoning models (e.g., DeepSeek-R1) can be highly vulnerable to noisy search results.
- "Lost-in-the-middle" is less of an issue, but models still fail to reliably identify relevant docs amid distractors.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification (2025)
- FinDER: Financial Dataset for Question Answering and Evaluating Retrieval-Augmented Generation (2025)
- UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking (2025)
- Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration (2025)
- Scaling Reasoning can Improve Factuality in Large Language Models (2025)
- Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks (2025)
- Resolving Conflicting Evidence in Automated Fact-Checking: A Study on Retrieval-Augmented LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper