Papers
arxiv:2506.09250

Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Published on Jun 10
· Submitted by reach-vb on Jun 13
Authors:
,

Abstract

Evaluation artifacts, particularly token limits and impractical instances in benchmarks, lead to misreported failures in Large Reasoning Models on planning puzzles.

AI-generated summary

Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures. Our analysis reveals three critical issues: (1) Tower of Hanoi experiments systematically exceed model output token limits at reported failure points, with models explicitly acknowledging these constraints in their outputs; (2) The authors' automated evaluation framework fails to distinguish between reasoning failures and practical constraints, leading to misclassification of model capabilities; (3) Most concerningly, their River Crossing benchmarks include mathematically impossible instances for N > 5 due to insufficient boat capacity, yet models are scored as failures for not solving these unsolvable problems. When we control for these experimental artifacts, by requesting generating functions instead of exhaustive move lists, preliminary experiments across multiple models indicate high accuracy on Tower of Hanoi instances previously reported as complete failures. These findings highlight the importance of careful experimental design when evaluating AI reasoning capabilities.

Community

Paper submitter

Claude against apple researchers!

·

Algorithms to solve these problems are in many fundamental computer science texts, how can you verify the LRM is actually reasoning these problems and not regurgitating memorized solutions?

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

I disagree. First, A function is a description of a process. It's not the performance of a process. Second, token limits are a limitation of models, not a limitation of the paper. I'll grant that the experiment could be designed to present the current state and ask for the next move, avoiding that limitation. Nonetheless, the experiment succeeding in testing the ability to carry out logical processes.

The model generated code to answer because that code is in its training data 10 times over

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.09250 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.09250 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.09250 in a Space README.md to link it from this page.

Collections including this paper 5