Papers
arxiv:2506.01062

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Published on Jun 1
· Submitted by tuvu on Jun 3
Authors:
,
,
,
,
,

Abstract

SealQA evaluates search-augmented language models' performance on fact-seeking questions with conflicting or noisy search results, revealing limitations in reasoning and factual accuracy.

AI-generated summary

We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.

Community

Paper author Paper submitter

SealQA: A challenge benchmark for retrieval-augmented generation / too-use LLMs, where questions trigger conflicting, ambiguous, or unhelpful web search results.

Key takeaways:

  • Frontier LLMs struggle on Seal-0 (SealQA's core set), where most chat LLMs (incl. GPT-4.1 w/ browsing) achieve near-zero accuracy.
  • More test-time compute does not yield reliable gains: o-series models often plateau or decline early.
  • Advanced reasoning models (e.g., DeepSeek-R1) can be highly vulnerable to noisy search results.
  • "Lost-in-the-middle" is less of an issue, but models still fail to reliably identify relevant docs amid distractors.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.01062 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.01062 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.