Scaling Diffusion Transformers Efficiently via μP
Abstract
Maximal Update Parametrization (μP) is extended to diffusion Transformers, demonstrating efficient hyperparameter transferability and reduced tuning costs across various models and tasks.
Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization (muP) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether muP of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard muP to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that muP of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-alpha, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing muP methodologies. Leveraging this result, we systematically demonstrate that DiT-muP enjoys robust HP transferability. Notably, DiT-XL-2-muP with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of muP on text-to-image generation by scaling PixArt-alpha from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under muP outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-alpha and 3% of consumption by human experts for MMDiT-18B. These results establish muP as a principled and efficient framework for scaling diffusion Transformers.
Community
Awesome! Scaling Diffusion Transformers to 18B Efficiently via μP!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling (2025)
- DyDiT++: Dynamic Diffusion Transformers for Efficient Visual Generation (2025)
- Don't be lazy: CompleteP enables compute-efficient deep transformers (2025)
- Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis (2025)
- REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers (2025)
- Fast Autoregressive Models for Continuous Latent Generation (2025)
- No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper