EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling
Abstract
Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution. We identify that existing autoencoders lack equivariance to semantic-preserving transformations like scaling and rotation, resulting in complex latent spaces that hinder generative performance. To address this, we propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality. By finetuning pre-trained autoencoders with EQ-VAE, we enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a 7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning. EQ-VAE is compatible with both continuous and discrete autoencoders, thus offering a versatile enhancement for a wide range of latent generative models. Project page and code: https://eq-vae.github.io/.
Community
EQ-VAE is a simple yet powerful regularization approach that makes latent representations equivariant to spatial transformations, leading to smoother latents and better generative models.
Why EQ-VAE?
🔹Smoother latent space ➔ easier to model & better generative performance
🔹No trade-off in reconstruction quality—rFID improves too!
🔹Works as a plug-and-play enhancement—no architectural changes needed!Fine-tuning pre-trained autoencoders with EQ-VAE for just 5 epochs unlocks major speedups:
âś… 7Ă— faster training convergence on DiT-XL/2
âś… 4Ă— faster training on REPAThe motivation:
SOTA autoencoders reconstruct images well but fail to maintain equivariance in latent space.
âś… If you scale an input image, its reconstruction is fine
❌ But if you scale the latent representation directly, the reconstruction degrades significantly.EQ-VAE fixes this by introducing a simple regularization objective:
👉 It aligns reconstructions of transformed latents with the corresponding transformed inputs.EQ-VAE provides a plug-and-play enhancement — no architectural changes are needed, working seamlessly with:
âś… Continuous autoencoders (SD-VAE, SDXL-VAE, SD3-VAE)
âś… Discrete autoencoders (VQ-GAN)Performance gains across the board:
✅ DiT-XL/2: gFID drops from 19.5 → 14.5 at 400K iterations
✅ REPA: Training time 4M → 1M iterations (4× speedup)
✅ MaskGIT: Training time 300 → 130 epochs (2× speedup)Why does EQ-VAE help so much?
We find a strong correlation between latent space complexity and generative performance.
🔹 EQ-VAE reduces the intrinsic dimension (ID) of the latent manifold
🔹 This makes the latent space simpler and easier to model
Unlike other regularization methods, it improves generative performance without hurting reconstruction quality.How fast does EQ-VAE refine latents?
We trained DiT-B/2 on the resulting latents at each fine-tuning epoch. Even after just a few epochs, gFID drops significantly—showing how quickly EQ-VAE improves the latent space.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Exploring Representation-Aligned Latent Space for Better Generation (2025)
- Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models (2025)
- Masked Autoencoders Are Effective Tokenizers for Diffusion Models (2025)
- Taming Feed-forward Reconstruction Models as Latent Encoders for 3D Generative Models (2024)
- Geometry-Preserving Encoder/Decoder in Latent Generative Models (2025)
- VidTwin: Video VAE with Decoupled Structure and Dynamics (2024)
- BD-Diff: Generative Diffusion Model for Image Deblurring on Unknown Domains with Blur-Decoupled Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 5
Browse 5 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper