arxiv:2502.06282

Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE

Published on Feb 10

· Submitted by

Hhaiduo on Feb 11

Upvote

Authors:

Haiduo Huang ,

Abstract

Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens, which are then verified in parallel by the larger target model. However, the limited capacity of the draft model often necessitates tree-based sampling to improve prediction accuracy, where multiple candidates are generated at each step. We identify a key limitation in this approach: the candidates at the same step are derived from the same representation, limiting diversity and reducing overall effectiveness. To address this, we propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions, effectively decoupling correlations among candidates. Furthermore, we introduce a hybrid inference strategy, combining autoregressive decoding for initial tokens with parallel decoding for subsequent stages, and enhance the latter with contrastive mechanism in features to improve accuracy. Our method significantly boosts prediction accuracy and achieves higher inference speedups. Extensive experiments across diverse models validate the effectiveness and robustness of our approach, establishing a new SOTA in speculative decoding. Our codes are available at https://github.com/haiduo/Jakiro.

View arXiv page View PDF Add to collection

Community

Hhaiduo

Paper author Paper submitter 3 days ago

We are thrilled to introduce Jakiro, an innovative approach that significantly enhances speculative decoding (SD) for large language models (LLMs). By leveraging the power of Mixture of Experts (MoE), Jakiro enables diverse predictions from independent experts, effectively addressing a key limitation of traditional tree-based sampling methods.

Key Highlights:
State-of-the-Art Performance: Jakiro achieves unprecedented improvements in prediction accuracy and inference speed.
Universal Compatibility: Works seamlessly with any LLM, including popular models like GPT-4, DeepSpeed, and more.
Room for Optimization: Current results are achieved without additional acceleration techniques (e.g., FlashAttention, VLLM), leaving ample room for further enhancements.

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.06282 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.06282 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.06282 in a Space README.md to link it from this page.