Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Abstract
Evaluation artifacts, particularly token limits and impractical instances in benchmarks, lead to misreported failures in Large Reasoning Models on planning puzzles.
Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures. Our analysis reveals three critical issues: (1) Tower of Hanoi experiments systematically exceed model output token limits at reported failure points, with models explicitly acknowledging these constraints in their outputs; (2) The authors' automated evaluation framework fails to distinguish between reasoning failures and practical constraints, leading to misclassification of model capabilities; (3) Most concerningly, their River Crossing benchmarks include mathematically impossible instances for N > 5 due to insufficient boat capacity, yet models are scored as failures for not solving these unsolvable problems. When we control for these experimental artifacts, by requesting generating functions instead of exhaustive move lists, preliminary experiments across multiple models indicate high accuracy on Tower of Hanoi instances previously reported as complete failures. These findings highlight the importance of careful experimental design when evaluating AI reasoning capabilities.
Community
Algorithms to solve these problems are in many fundamental computer science texts, how can you verify the LRM is actually reasoning these problems and not regurgitating memorized solutions?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity (2025)
- Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models (2025)
- When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs (2025)
- Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models (2025)
- Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs (2025)
- CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models (2025)
- An Empirical Study of LLM Reasoning Ability Under Strict Output Length Constraint (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
I disagree. First, A function is a description of a process. It's not the performance of a process. Second, token limits are a limitation of models, not a limitation of the paper. I'll grant that the experiment could be designed to present the current state and ask for the next move, avoiding that limitation. Nonetheless, the experiment succeeding in testing the ability to carry out logical processes.
The model generated code to answer because that code is in its training data 10 times over
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper