NSFW‑Wan‑VAE (v1 • vae_s500.pth
)
Drop‑in 3‑D causal VAE decoder for Wan 2.1‑based video diffusion pipelines, specialised for anatomically‑realistic adult content.
✨ What is this?
NSFW‑Wan‑VAE
is a replacement decoder for the official Wan 2.1 3‑D Variational Auto‑Encoder.
It is designed to be plug‑compatible with any DiT / UNet that was trained on the original Wan VAE latents, while offering:
- Sharper micro‑texture in explicit regions (skin pores, hair, tattoos, lingerie details, etc.).
- Higher dynamic range (latents up to σ ≈ 1.4 without colour clip or channel bias).
- Improved temporal coherence when decoding long video chunks (≤ 81 frames @ 480‑720 p).
TL;DR Swap this
.pth
for the stock VAE and existing NSFW DiTs should render crisper, more consistent anatomy with fewer colour artefacts.
🏗 Architectural roots & inspirations
Building block | Reference |
---|---|
Wan 2.1 Causal 3‑D VAE – spatial 8× / temporal 4× compression, decoded in causal chunks | Alibaba Wan T2V 2.1 |
Decoder‑only fine‑tuning – preserve latent manifold used by upstream diffusion model | StabilityAI sd-vae-ft-* (1.5 / SDXL) |
σ‑VAE objective – learnable reconstruction variance auto‑balances KL vs. pixel/perceptual loss | Rybkin et al., 2023 |
High‑σ latent augmentation – expose decoder to the variance range produced under CFG sampling | Google Imagen dynamic‑thresholding, LAION community work |
Frequency‑domain (DCT) auxiliary loss – enforce high‑frequency fidelity without GAN | Wang & Kamilov, 2022 |
🔬 Training recipe (v1 checkpoint)
Item | Setting |
---|---|
Base weights | Wan 2.1 VAE (encoder + decoder) |
Frozen | Entire encoder → latent geometry unchanged |
Trainable | Decoder (all layers) + single learnable log σ |
Dataset | 30 k curated 480 p video clips (17 frames); balanced for gender, skin‑tone, lighting, camera motion |
Augmentations | Horizontal flip, random crop to 8‑pixel grid, probabilistic σ‑mix (scale 1.3–1.7) |
Loss mix | Gaussian NLL (σ‑VAE) + 0.6 × LPIPS‑VGG + (0 → 0.1) × DCT |
Optimiser | AdamW‑8bit, LR 3 e‑5, cosine decay, EMA 0.999 |
Precision | bf16 everywhere, FP32 on conv_out |
Hardware | 4 × A6000 48 GB, gradient‑checkpointing |
Status: checkpoint
vae_s500.pth
represents the first public milestone (≈ 500 steps).
It is already viable for inference; later checkpoints will further push detail and dynamic‑range targets.
💡 Intended use
✅ Primary | ⚠️ Considerations |
---|---|
Pair with Wan 2.1‑compatible DiT / UNet for text‑to‑video, img2vid or in‑painting tasks that depict explicit adult content. | Model does not alter any content‑filtering logic; end‑users must follow platform guidelines and local laws. |
Research on latent‑space editing or LoRA alignment techniques in NSFW domains. | The VAE alone cannot guarantee perfect temporal consistency; for best results, combine with a video‑aware diffusion scheduler. |
🚧 Current limitations
- Encoder remains untouched – it will still blur OOD modalities (e.g. HDR, medical imaging).
- Trained only up to 480 p; decoding 720p - 1080 p works but may soften fine print / small logos.
- Not validated on minors, non‑consensual or illegal content scenes – strictly prohibited by license.
🔏 Responsible publication
This repository is released under the CreativeML OpenRAIL‑M licence.
By downloading or finetuning the weights you agree not to:
- Distribute or generate content that is sexual involving minors, exploitative, non‑consensual or otherwise illegal.
- Violate any applicable jurisdictional laws or platform ToS.
- Remove or alter this model card or licence notice.
For questions about usage boundaries, please open an Issue or contact the maintainers privately.
✍️ Citation
If you use NSFW‑Wan‑VAE
in academic work, please cite: