Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding
Abstract
Speculative Decoding (SD) has become an important technique in accelerating the inference speed of large language models. Conventional SD methods employ a fixed draft length, which ignores the token generation difficulty across tasks. Consequently, in this paper, we address such an issue and introduce SVIP - a difficulty-aware dynamic draft length policy for speculative decoding systems. Based on a theoretical lower bound of draft token acceptance rate and its inference-time approximation, SVIP adaptively determines the lengths of draft sequences based on the entropy of each draft token distribution. Experimental results on mainstream SD benchmarks and frameworks demonstrate the superior performance of SVIP, achieving up to 20\% walltime speedup on SpecBench over baseline SD methods and 60\% speedup on MT-Bench for long-form generation of up to 8K tokens. Moreover, SVIP is totally training-free and compatible with any existing SD methods that generate draft tokens autoregressively. Experimental results also show that SVIP yields consistent walltime improvement on top of GliDe & CaPE and EAGLE-2.
Community
Think your LLM inference is slow? Give it an SVIP! With about 10 lines of code, SVIP can boost the speedup ratio of any autoregressive speculative decoding system without any trainingš¤
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FastDraft: How to Train Your Draft (2024)
- AdaEDL: Early Draft Stopping for Speculative Decoding of Large Language Models via an Entropy-based Lower Bound on Token Acceptance Probability (2024)
- ParallelSpec: Parallel Drafter for Efficient Speculative Decoding (2024)
- AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration (2024)
- SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration (2024)
- SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding (2024)
- DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper