Abstract
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.
Community
We made a deep dive video for this paper: https://www.youtube.com/watch?v=JOeqmhLaJmk. Time = Intelligence?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HARP: A challenging human-annotated math reasoning benchmark (2024)
- Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models (2025)
- A Survey on LLM Test-Time Compute via Search: Tasks, LLM Profiling, Search Algorithms, and Relevant Frameworks (2025)
- RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques (2025)
- RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? (2025)
- Control LLM: Controlled Evolution for Intelligence Retention in LLM (2025)
- Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 5
Browse 5 models citing this paperDatasets citing this paper 14
Browse 14 datasets citing this paperSpaces citing this paper 0
No Space linking this paper