arxiv:2506.09250

Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Published on Jun 10

· Submitted by

Authors:

Abstract

Evaluation artifacts, particularly token limits and impractical instances in benchmarks, lead to misreported failures in Large Reasoning Models on planning puzzles.

AI-generated summary

Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures. Our analysis reveals three critical issues: (1) Tower of Hanoi experiments systematically exceed model output token limits at reported failure points, with models explicitly acknowledging these constraints in their outputs; (2) The authors' automated evaluation framework fails to distinguish between reasoning failures and practical constraints, leading to misclassification of model capabilities; (3) Most concerningly, their River Crossing benchmarks include mathematically impossible instances for N > 5 due to insufficient boat capacity, yet models are scored as failures for not solving these unsolvable problems. When we control for these experimental artifacts, by requesting generating functions instead of exhaustive move lists, preliminary experiments across multiple models indicate high accuracy on Tower of Hanoi instances previously reported as complete failures. These findings highlight the importance of careful experimental design when evaluating AI reasoning capabilities.

View arXiv page View PDF Add to collection

Community

reach-vb

Paper submitter 1 day ago

Claude against apple researchers!

jamesnesfield

about 8 hours ago

•

edited about 8 hours ago

Algorithms to solve these problems are in many fundamental computer science texts, how can you verify the LRM is actually reasoning these problems and not regurgitating memorized solutions?

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

coder0xff

1 day ago

I disagree. First, A function is a description of a process. It's not the performance of a process. Second, token limits are a limitation of models, not a limitation of the paper. I'll grant that the experiment could be designed to present the current state and ask for the next move, avoiding that limitation. Nonetheless, the experiment succeeding in testing the ability to carry out logical processes.