arxiv:2507.08422

Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers

Published on Jul 11

· Submitted by

Agorium on Jul 23

Upvote

Authors:

Wongi Jeong ,

Kyungryeol Lee ,

Hoigi Seo ,

Abstract

Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0times speed-up on FLUX and 3.0times on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.

View arXiv page View PDF GitHub 36 Add to collection

Community

Agorium

Paper author Paper submitter 6 days ago

Upsample What Matters: Region‑Adaptive Latent Sampling for Accelerated Diffusion Transformers 🚀

Struggling with slow high-res sampling in diffusion models? This new paper proposes a clever solution — meet RALU (Region-Adaptive Latent Upsampling), a method that speeds up generation without sacrificing quality.

🧠 Core Idea:
Instead of applying expensive high-res denoising everywhere, RALU works in 3 stages:
1. Low-res denoising over the full image to get a global structure
2. Selective high-res refinement only where artifacts are likely
3. Final global high-res pass to polish everything up

✨ No retraining needed, and you get faster sampling (up to x7.0 on FLUX) + fewer artifacts. It’s like telling your model: “Focus where it matters!”

🎯 Why it matters:
• Works on state-of-the-art pre-trained diffusion transformers (no fine-tuning!)
• Cuts down computational cost while preserving quality
• Promising for high-res image generation in real-time or resource-constrained settings

songh111

6 days ago

Good paper! I would like to ask whether the code for the paper will be open-sourced.

ignow

Paper author 6 days ago

https://github.com/ignoww/RALU

The GitHub repository is now live, and the code will be made publicly available soon.
We’d really appreciate your patience—and if you find the project interesting, feel free to give it a ⭐️ on GitHub!