arxiv:2502.09509

EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling

Published on Feb 13

· Submitted by

gkakogeorgiou on Feb 18

Upvote

Authors:

Theodoros Kouzelis ,

Ioannis Kakogeorgiou ,

Abstract

Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution. We identify that existing autoencoders lack equivariance to semantic-preserving transformations like scaling and rotation, resulting in complex latent spaces that hinder generative performance. To address this, we propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality. By finetuning pre-trained autoencoders with EQ-VAE, we enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a 7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning. EQ-VAE is compatible with both continuous and discrete autoencoders, thus offering a versatile enhancement for a wide range of latent generative models. Project page and code: https://eq-vae.github.io/.

View arXiv page View PDF Add to collection

Community

gkakogeorgiou

Paper author Paper submitter 8 days ago

EQ-VAE is a simple yet powerful regularization approach that makes latent representations equivariant to spatial transformations, leading to smoother latents and better generative models.
Why EQ-VAE?
🔹Smoother latent space ➔ easier to model & better generative performance
🔹No trade-off in reconstruction quality—rFID improves too!
🔹Works as a plug-and-play enhancement—no architectural changes needed!
Fine-tuning pre-trained autoencoders with EQ-VAE for just 5 epochs unlocks major speedups:
✅ 7× faster training convergence on DiT-XL/2
✅ 4× faster training on REPA
The motivation:
SOTA autoencoders reconstruct images well but fail to maintain equivariance in latent space.
✅ If you scale an input image, its reconstruction is fine
❌ But if you scale the latent representation directly, the reconstruction degrades significantly.
EQ-VAE fixes this by introducing a simple regularization objective:
👉 It aligns reconstructions of transformed latents with the corresponding transformed inputs.
EQ-VAE provides a plug-and-play enhancement — no architectural changes are needed, working seamlessly with:
✅ Continuous autoencoders (SD-VAE, SDXL-VAE, SD3-VAE)
✅ Discrete autoencoders (VQ-GAN)
Performance gains across the board:
✅ DiT-XL/2: gFID drops from 19.5 → 14.5 at 400K iterations
✅ REPA: Training time 4M → 1M iterations (4× speedup)
✅ MaskGIT: Training time 300 → 130 epochs (2× speedup)
Why does EQ-VAE help so much?
We find a strong correlation between latent space complexity and generative performance.
🔹 EQ-VAE reduces the intrinsic dimension (ID) of the latent manifold
🔹 This makes the latent space simpler and easier to model
Unlike other regularization methods, it improves generative performance without hurting reconstruction quality.
How fast does EQ-VAE refine latents?
We trained DiT-B/2 on the resulting latents at each fine-tuning epoch. Even after just a few epochs, gFID drops significantly—showing how quickly EQ-VAE improves the latent space.