CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up
Abstract
Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this issue, we aim at a linear attention mechanism in this paper that reduces the complexity of pre-trained DiTs to linear. We begin our exploration with a comprehensive summary of existing efficient attention mechanisms and identify four key factors crucial for successful linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity. Based on these insights, we introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token, and thus achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. Simultaneously, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images. Furthermore, we investigate favorable properties in the distilled attention layers, such as zero-shot generalization cross various models and plugins, and improved support for multi-GPU parallel inference. Models and codes are available here: https://github.com/Huage001/CLEAR.
Community
"Simultaneously, it reduces attention computations by 99.5%"
I feel like you just skipped over that part like it was nothing. 😂
Thanks for the comment😂
The number is calculated in terms of FLOPS, which is indeed surprisingly significant at higher resolutions like 8K, if we compare a linear-complexity model with the original one.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TinyFusion: Diffusion Transformers Learned Shallow (2024)
- Efficient Scaling of Diffusion Transformers for Text-to-Image Generation (2024)
- ASGDiffusion: Parallel High-Resolution Generation with Asynchronous Structure Guidance (2024)
- On the Surprising Effectiveness of Attention Transfer for Vision Transformers (2024)
- Bridging the Divide: Reconsidering Softmax and Linear Attention (2024)
- FlexDiT: Dynamic Token Density Control for Diffusion Transformer (2024)
- ScaleKD: Strong Vision Transformers Could Be Excellent Teachers (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper