Papers
arxiv:2502.09509

EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling

Published on Feb 13
· Submitted by gkakogeorgiou on Feb 18
Authors:
,

Abstract

Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution. We identify that existing autoencoders lack equivariance to semantic-preserving transformations like scaling and rotation, resulting in complex latent spaces that hinder generative performance. To address this, we propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality. By finetuning pre-trained autoencoders with EQ-VAE, we enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a 7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning. EQ-VAE is compatible with both continuous and discrete autoencoders, thus offering a versatile enhancement for a wide range of latent generative models. Project page and code: https://eq-vae.github.io/.

Community

Paper author Paper submitter
  1. EQ-VAE is a simple yet powerful regularization approach that makes latent representations equivariant to spatial transformations, leading to smoother latents and better generative models.

  2. Why EQ-VAE?
    🔹Smoother latent space ➔ easier to model & better generative performance
    🔹No trade-off in reconstruction quality—rFID improves too!
    🔹Works as a plug-and-play enhancement—no architectural changes needed!

  3. Fine-tuning pre-trained autoencoders with EQ-VAE for just 5 epochs unlocks major speedups:
    âś… 7Ă— faster training convergence on DiT-XL/2
    âś… 4Ă— faster training on REPA

  4. The motivation:
    SOTA autoencoders reconstruct images well but fail to maintain equivariance in latent space.
    âś… If you scale an input image, its reconstruction is fine
    ❌ But if you scale the latent representation directly, the reconstruction degrades significantly.

  5. EQ-VAE fixes this by introducing a simple regularization objective:
    👉 It aligns reconstructions of transformed latents with the corresponding transformed inputs.

  6. EQ-VAE provides a plug-and-play enhancement — no architectural changes are needed, working seamlessly with:
    âś… Continuous autoencoders (SD-VAE, SDXL-VAE, SD3-VAE)
    âś… Discrete autoencoders (VQ-GAN)

  7. Performance gains across the board:
    ✅ DiT-XL/2: gFID drops from 19.5 → 14.5 at 400K iterations
    ✅ REPA: Training time 4M → 1M iterations (4× speedup)
    ✅ MaskGIT: Training time 300 → 130 epochs (2× speedup)

  8. Why does EQ-VAE help so much?
    We find a strong correlation between latent space complexity and generative performance.
    🔹 EQ-VAE reduces the intrinsic dimension (ID) of the latent manifold
    🔹 This makes the latent space simpler and easier to model
    Unlike other regularization methods, it improves generative performance without hurting reconstruction quality.

  9. How fast does EQ-VAE refine latents?
    We trained DiT-B/2 on the resulting latents at each fine-tuning epoch. Even after just a few epochs, gFID drops significantly—showing how quickly EQ-VAE improves the latent space.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 5

Browse 5 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.09509 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.09509 in a Space README.md to link it from this page.

Collections including this paper 4