Accelerating Diffusion LLMs via Adaptive Parallel Decoding
Abstract
Adaptive parallel decoding (APD) enhances the throughput of diffusion large language models (dLLMs) by dynamically adjusting parallel token generation without significantly diminishing quality.
The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
Community
Accelerating Diffusion LLMs via Adaptive Parallel Decoding
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion (2025)
- DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding (2025)
- Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding (2025)
- D-AR: Diffusion via Autoregressive Models (2025)
- STree: Speculative Tree Decoding for Hybrid State-Space Models (2025)
- PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation (2025)
- PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper