Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
Abstract
SVG2 is a training-free framework that enhances video generation efficiency and quality by accurately identifying and processing critical tokens using semantic-aware permutation and dynamic budget control.
Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively.
Community
This paper proposes a sparse attention technique to accelerate the video diffusion model's generation process. On the state-of-the-art video generation models, such as Wan 2.1 and HunyuanVideo, Sparse-VideoGen 2 achieves nearly lossless generation quality and demonstrates 2.3x speedup on HunyuanVideo and 1.8x speedup on Wan 2.1.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance (2025)
- VSA: Faster Video Diffusion with Trainable Sparse Attention (2025)
- VORTA: Efficient Video Diffusion via Routing Sparse Attention (2025)
- Training-Free Efficient Video Generation via Dynamic Token Carving (2025)
- HoliTom: Holistic Token Merging for Fast Video Large Language Models (2025)
- RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy (2025)
- MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper