Abstract
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4x fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios.
Community
A new algorithm for efficient LLM reasoning.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AdaEAGLE: Optimizing Speculative Decoding via Explicit Modeling of Adaptive Draft Structures (2024)
- Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning (2024)
- Entropy-Regularized Process Reward Model (2024)
- Inference Scaling vs Reasoning: An Empirical Analysis of Compute-Optimal LLM Problem-Solving (2024)
- Constrained Decoding with Speculative Lookaheads (2024)
- Dynamic Scaling of Unit Tests for Code Reward Modeling (2025)
- Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Speculative decoding makes the inference of LLM efficient. The unbiasedness requirement of the output of speculative decoding approach is a concern. Reward-guided speculative decoding can weigh up an output which is could be rare and the distribution wouldn't match with the distribution of LLM model.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper