Papers
arxiv:2507.02659

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

Published on Jul 3
· Submitted by justinyyy on Jul 8
Authors:
,
,
,

Abstract

OmniDraft, a unified framework, addresses cross-vocabulary mismatch and improves decoding speed by allowing a single draft model to interact dynamically with diverse target models in online settings.

AI-generated summary

Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the ``one drafter for all'' paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.

Community

Paper author Paper submitter

We propose OmniDraft, enabling one draft model to work with any target model (Llama, Qwen, ...) in speculative decoding. To overcome the vocabulary mismatch, we introduce cross-vocabulary speculative decoding and distillation, using an n-gram cache built online. We also leverage adaptive drafting to further improve acceptance rate and speed-up. Overall, we explore the "one drafter for all" paradigm, opening the avenue for more flexible design of speculative decoding.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.02659 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.02659 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.02659 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.