Papers
arxiv:2503.24235

What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

Published on Mar 31
· Submitted by DonJoey on Apr 1
Authors:
,
,
,

Abstract

As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing'' has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in specialized reasoning tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering a systemic understanding. To fill this gap, we propose a unified, multidimensional framework structured along four core dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct an extensive review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique functional roles of individual techniques within the broader TTS landscape. From this analysis, we distill the major developmental trajectories of TTS to date and offer hands-on guidelines for practical deployment. Furthermore, we identify several open challenges and offer insights into promising future directions, including further scaling, clarifying the functional essence of techniques, generalizing to more tasks, and more attributions.

Community

Paper author Paper submitter

This is our latest survey on test-time scaling (TTS), and it differs from recent related surveys in several key aspects:

a. We focus specifically on the TTS strategies themselves, rather than broadly covering reasoning or prompting paradigms.

b. Unlike timeline-based overviews, our survey proposes a unified taxonomy that decomposes existing TTS works along four orthogonal dimensions:

  1. 🧩 What to scale
  2. ⚙️ How to scale
  3. 🌍 Where to scale
  4. 📈 How well to scale
    This taxonomy allows researchers to quickly locate, interpret, and apply a given method while making its core contributions and trade-offs immediately clear.

c. Our survey emphasizes practical utility: 1. We will continuously expand coverage to include how TTS generalizes to diverse downstream tasks—such as agents, safety, and evaluation. 2. We are also building a growing collection of hands-on guidelines, distilled from the practices and insights of front-line researchers.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.24235 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.24235 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.24235 in a Space README.md to link it from this page.

Collections including this paper 12