T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models
Abstract
Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving self-verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably self-verify their outputs under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated self-verification (T1), which delegates memorization-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 generalizes effectively to both mathematical (MATH500) and multi-domain knowledge-intensive tasks (MMLU-Pro). Our findings highlight the potential of tool integration to substantially improve the self-verification abilities of sLMs.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning (2025)
- What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models (2025)
- Efficient Test-Time Scaling via Self-Calibration (2025)
- When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning (2025)
- Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers (2025)
- Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators (2025)
- m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper