arxiv:2411.18462

Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding

Published on Nov 27, 2024

· Submitted by

Geralt-Targaryen on Nov 28, 2024

Upvote

Authors:

Ziyin Zhang ,

Jiahao Xu ,

Zhiwei He ,

Zhaopeng Tu

Abstract

Speculative Decoding (SD) has become an important technique in accelerating the inference speed of large language models. Conventional SD methods employ a fixed draft length, which ignores the token generation difficulty across tasks. Consequently, in this paper, we address such an issue and introduce SVIP - a difficulty-aware dynamic draft length policy for speculative decoding systems. Based on a theoretical lower bound of draft token acceptance rate and its inference-time approximation, SVIP adaptively determines the lengths of draft sequences based on the entropy of each draft token distribution. Experimental results on mainstream SD benchmarks and frameworks demonstrate the superior performance of SVIP, achieving up to 20\% walltime speedup on SpecBench over baseline SD methods and 60\% speedup on MT-Bench for long-form generation of up to 8K tokens. Moreover, SVIP is totally training-free and compatible with any existing SD methods that generate draft tokens autoregressively. Experimental results also show that SVIP yields consistent walltime improvement on top of GliDe & CaPE and EAGLE-2.

View arXiv page View PDF Add to collection

Community

Geralt-Targaryen

Paper author Paper submitter Nov 28, 2024

Think your LLM inference is slow? Give it an SVIP! With about 10 lines of code, SVIP can boost the speedup ratio of any autoregressive speculative decoding system without any training🤗

librarian-bot

Nov 29, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.18462 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.18462 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.