Abstract
Fractured Sampling optimizes inference in large language models by balancing token usage and accuracy through truncated reasoning trajectories.
Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning.
Community
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions
per trajectory, and (3) the depth at which reasoning traces are truncated.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Scalable Chain of Thoughts via Elastic Reasoning (2025)
- Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation (2025)
- AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning (2025)
- Hawkeye:Efficient Reasoning with Model Collaboration (2025)
- S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models (2025)
- SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning (2025)
- Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper