MS-LC-EQ-D-VR VAE: another reproduction of EQ-VAE on variable VAEs and then some
Current VAEs present:
- SDXL VAE
- FLUX VAE
EQ-VAE paper: https://arxiv.org/abs/2502.09509
VIVAT paper: https://arxiv.org/pdf/2506.07863v1
Thanks to Kohaku and his reproduction that made me look into this: https://huggingface.co/KBlueLeaf/EQ-SDXL-VAE
Top: reconstructed by VAE image. Bottom: Latent to PCA
Upper one is original VAE, bottome one is EQ-VAE finetuned VAE.
Introduction
Refer to https://huggingface.co/KBlueLeaf/EQ-SDXL-VAE for introduction to EQ-VAE.
This implementation additionally utilizes some of fixes proposed in VIVAT paper, and custom in-house regularization techniques, as well as training implementation.
For additional examples and more information refer to: https://arcenciel.io/articles/20 and https://arcenciel.io/models/10994
Visual Examples
Usage
This is a finetuned SDXL VAE, adapted with new regularization, and other techniques. You can use this with your existing SDXL model, but image will be quite artefacting, particularly - oversharpening and ringing.
This VAE is supposed ot be used for finetune, after that images will become normal. But be aware, compatibility with old VAEs, that are not EQ, will be lost(They will become blurry).
Training Setup
Base SDXL:
Base Model: SDXL-VAE
Dataset: ~12.8k anime images
Batch Size: 128 (bs 8, grad acc 16)
Samples Seen: ~75k
Loss Weights:
- L1: 0.3
- L2: 0.5
- SSIM: 0.5
- LPIPS: 0.5
- KL: 0.000001
- Consistency Loss: 0.75
Both Encoder and Decoder were trained.
Training Time: ~8-10 hours on 4060Ti
B2:
Base Model: First version
Dataset: 87.8k anime images
Batch Size: 128 (bs 8, grad acc 16)
Samples Seen: ~150k
Loss Weights:
- L1: 0.2
- L2: 0.4
- SSIM: 0.6
- LPIPS: 0.8
- KL: 0.000001
- Consistency Loss: 0.75
Both Encoder and Decoder were trained.
Training Time: ~16 hours on 4060Ti
B3:
Base Model: B2
Dataset: 162.8k anime images
Batch Size: 128 (bs 8, grad acc 16)
Samples Seen: ~225k
Loss Weights:
- L1: 0.2
- L2: 0.4
- SSIM: 0.6
- LPIPS: 0.8
- KL: 0.000001
- Consistency Loss: 0.75
Both Encoder and Decoder were trained.
Training Time: ~24 hours on 4060Ti
B2 is a direct continuation of base version, stats displayed are cumulative across multiple runs. I took batch of 75k images, so samples seen never repeated.
B3 repeats B2 for another batch of data and further solidifies cleaner latents. Minor tweaks were done to training code for better regularization.
Base FLUX:
Base Model: FLUX-VAE
Dataset: ~12.8k anime images
Batch Size: 128 (bs 8, grad acc 16)
Samples Seen: ~62.5k
Loss Weights:
- L1: 0.3
- L2: 0.4
- SSIM: 0.6
- LPIPS: 0.6
- KL: 0.000001
- Consistency Loss: 0.75
Both Encoder and Decoder were trained.
Training Time: ~6 hours on 4060Ti
Evaluation Results
Im using small test set i have on me, separated into anime(434) and photo(500) images. Additionally, im measuring noise in latents. Sorgy for no larger test sets.
Results on small benchmark of 500 photos
VAE SDXL | L1 ↓ | L2 ↓ | PSNR ↑ | LPIPS ↓ | MS‑SSIM ↑ | KL ↓ | RFID ↓ |
---|---|---|---|---|---|---|---|
sdxl_vae | 6.282 | 10.534 | 29.278 | 0.063 | 0.947 | 31.216 | 4.819 |
Kohaku EQ-VAE | 6.423 | 10.428 | 29.140 | 0.082 | 0.945 | 43.236 | 6.202 |
Anzhc MS‑LC‑EQ‑D‑VR VAE | 5.975 | 10.096 | 29.526 | 0.106 | 0.952 | 33.176 | 5.578 |
Anzhc MS‑LC‑EQ‑D‑VR VAE B2 | 6.082 | 10.214 | 29.432 | 0.103 | 0.951 | 33.535 | 5.509 |
Anzhc MS‑LC‑EQ‑D‑VR VAE B3 | 6.066 | 10.151 | 29.475 | 0.104 | 0.951 | 34.341 | 5.538 |
VAE FLUX | L1 ↓ | L2 ↓ | PSNR ↑ | LPIPS ↓ | MS‑SSIM ↑ | KL ↓ | rFID ↓ |
---|---|---|---|---|---|---|---|
FLUX VAE | 4.147 | 6.294 | 33.389 | 0.021 | 0.987 | 12.146 | 0.565 |
MS‑LC‑EQ‑D‑VR VAE FLUX | 3.799 | 6.077 | 33.807 | 0.032 | 0.986 | 10.992 | 1.692 |
Noise in latents
VAE SDXL | Noise ↓ |
---|---|
sdxl_vae | 27.508 |
Kohaku EQ-VAE | 17.395 |
Anzhc MS‑LC‑EQ‑D‑VR VAE | 15.527 |
Anzhc MS‑LC‑EQ‑D‑VR VAE B2 | 13.914 |
Anzhc MS‑LC‑EQ‑D‑VR VAE B3 | 13.124 |
VAE FLUX | Noise ↓ |
---|---|
FLUX VAE | 10.499 |
MS‑LC‑EQ‑D‑VR VAE FLUX | 7.635 |
Results on a small benchmark of 434 anime arts
VAE SDXL | L1 ↓ | L2 ↓ | PSNR ↑ | LPIPS ↓ | MS‑SSIM ↑ | KL ↓ | RFID ↓ |
---|---|---|---|---|---|---|---|
sdxl_vae | 4.369 | 7.905 | 31.080 | 0.038 | 0.969 | 35.057 | 5.088 |
Kohaku EQ-VAE | 4.818 | 8.332 | 30.462 | 0.048 | 0.967 | 50.022 | 7.264 |
Anzhc MS‑LC‑EQ‑D‑VR VAE | 4.351 | 7.902 | 30.956 | 0.062 | 0.970 | 36.724 | 6.239 |
Anzhc MS‑LC‑EQ‑D‑VR VAE B2 | 4.313 | 7.935 | 30.951 | 0.059 | 0.970 | 36.963 | 6.147 |
Anzhc MS‑LC‑EQ‑D‑VR VAE B3 | 4.323 | 7.910 | 30.977 | 0.058 | 0.970 | 37.809 | 6.075 |
VAE FLUX | L1 ↓ | L2 ↓ | PSNR ↑ | LPIPS ↓ | MS‑SSIM ↑ | KL ↓ | rFID ↓ |
---|---|---|---|---|---|---|---|
FLUX VAE | 3.060 | 4.775 | 35.440 | 0.011 | 0.991 | 12.472 | 0.670 |
MS‑LC‑EQ‑D‑VR VAE FLUX | 2.933 | 4.856 | 35.251 | 0.018 | 0.990 | 11.225 | 1.561 |
Noise in latents
VAE SDXL | Noise ↓ |
---|---|
sdxl_vae | 26.359 |
Kohaku EQ-VAE | 17.314 |
Anzhc MS‑LC‑EQ‑D‑VR VAE | 14.976 |
Anzhc MS‑LC‑EQ‑D‑VR VAE B2 | 13.649 |
Anzhc MS‑LC‑EQ‑D‑VR VAE B3 | 13.247 |
VAE FLUX | Noise ↓ |
---|---|
FLUX VAE | 9.913 |
MS‑LC‑EQ‑D‑VR VAE FLUX | 7.723 |
KL loss suggests that this VAE implementation is much closer to SDXL, and likely will be a better candidate for further finetune, but that is just a theory.
B2 further improves latent clarity, while maintaining same or better performance. Particularly improved very fine texture handling, which previously would be overcorrected into smooth surface. Performs better in such cases now.
B3 cleans them up ever more, but at that point visually they are +- same.
References
[1] [2502.09509] EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling
[2] [2506.07863] VIVAT: VIRTUOUS IMPROVING VAE TRAINING THROUGH ARTIFACT MITIGATION
[3] sdxl-vae
Cite
@misc{anzhc_ms-lc-eq-d-vr_vae,
author = {Anzhc},
title = {MS-LC-EQ-D-VR VAE: another reproduction of EQ-VAE on cariable VAEs and then some},
year = {2025},
howpublished = {Hugging Face model card},
url = {https://huggingface.co/Anzhc/MS-LC-EQ-D-VR_VAE},
note = {Finetuned SDXL-VAE with EQ regularization and more, for improved latent representation.}
}
Acknowledgement
My friend Bluvoll, for no particular reason.
- Downloads last month
- 889
Model tree for Anzhc/MS-LC-EQ-D-VR_VAE
Base model
stabilityai/sdxl-vae