Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
Abstract
LatentSeek enhances LLM reasoning using test-time instance-level adaptation in latent space, improving performance across various benchmarks.
Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.
Community
Links:
arxiv: https://arxiv.org/pdf/2505.13308v1
project page: https://bigai-nlco.github.io/LatentSeek/
twitter: https://x.com/Hengli_Li_pku/status/1925496962876637329
No training needed, only self reward! SOTA reasoning performance!
Search in the latent space, also demonstrating potential of test-time scaling.
Introducing LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model’s latent space.
- Superior Performance on Complex Math Reasoning: LatentSeek consistently outperforms all baselines, achieving an average improvement of 4.73% points over CoT across all model families and prompt configurations.
- Generalizable across backbones: LatentSeek demonstrates superior performance across multiple model families. Also, in terms of model scales, our method consistently outperforms all baseline models across diverse datasets and prompt types
- Generalizable across prompts: The Qwen2.5 series was explicitly trained using Prompt 1; nevertheless, our methods still achieve notable performance gains.
- The large potential of LatentSeek, even when guided by sparse reward: When using PSRM, LatentSeek achieves an average improvement of 19.12% score over the CoT method and surpasses the self-reward version by an average of 12.57% score.
- Test time scaling: The ideal reward model yields a consistently improving trend and outperforms the self-reward method across all model backbones, suggesting that test-time scaling can be achieved without necessitating a dense reward function.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning (2025)
- TTRL: Test-Time Reinforcement Learning (2025)
- SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning (2025)
- J1: Exploring Simple Test-Time Scaling for LLM-as-a-Judge (2025)
- Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 (2025)
- Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models (2025)
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper