NSFW‑Wan‑VAE (v1 • vae_s500.pth)

Drop‑in 3‑D causal VAE decoder for Wan 2.1‑based video diffusion pipelines, specialised for anatomically‑realistic adult content.


✨ What is this?

NSFW‑Wan‑VAE is a replacement decoder for the official Wan 2.1 3‑D Variational Auto‑Encoder.
It is designed to be plug‑compatible with any DiT / UNet that was trained on the original Wan VAE latents, while offering:

  • Sharper micro‑texture in explicit regions (skin pores, hair, tattoos, lingerie details, etc.).
  • Higher dynamic range (latents up to σ ≈ 1.4 without colour clip or channel bias).
  • Improved temporal coherence when decoding long video chunks (≤ 81 frames @ 480‑720 p).

TL;DR Swap this .pth for the stock VAE and existing NSFW DiTs should render crisper, more consistent anatomy with fewer colour artefacts.


🏗 Architectural roots & inspirations

Building block Reference
Wan 2.1 Causal 3‑D VAE – spatial 8× / temporal 4× compression, decoded in causal chunks Alibaba Wan T2V 2.1
Decoder‑only fine‑tuning – preserve latent manifold used by upstream diffusion model StabilityAI sd-vae-ft-* (1.5 / SDXL)
σ‑VAE objective – learnable reconstruction variance auto‑balances KL vs. pixel/perceptual loss Rybkin et al., 2023
High‑σ latent augmentation – expose decoder to the variance range produced under CFG sampling Google Imagen dynamic‑thresholding, LAION community work
Frequency‑domain (DCT) auxiliary loss – enforce high‑frequency fidelity without GAN Wang & Kamilov, 2022

🔬 Training recipe (v1 checkpoint)

Item Setting
Base weights Wan 2.1 VAE (encoder + decoder)
Frozen Entire encoder → latent geometry unchanged
Trainable Decoder (all layers) + single learnable log σ
Dataset 30 k curated 480 p video clips (17 frames); balanced for gender, skin‑tone, lighting, camera motion
Augmentations Horizontal flip, random crop to 8‑pixel grid, probabilistic σ‑mix (scale 1.3–1.7)
Loss mix Gaussian NLL (σ‑VAE) + 0.6 × LPIPS‑VGG + (0 → 0.1) × DCT
Optimiser AdamW‑8bit, LR 3 e‑5, cosine decay, EMA 0.999
Precision bf16 everywhere, FP32 on conv_out
Hardware 4 × A6000 48 GB, gradient‑checkpointing

Status: checkpoint vae_s500.pth represents the first public milestone (≈ 500 steps).
It is already viable for inference; later checkpoints will further push detail and dynamic‑range targets.


💡 Intended use

✅ Primary ⚠️ Considerations
Pair with Wan 2.1‑compatible DiT / UNet for text‑to‑video, img2vid or in‑painting tasks that depict explicit adult content. Model does not alter any content‑filtering logic; end‑users must follow platform guidelines and local laws.
Research on latent‑space editing or LoRA alignment techniques in NSFW domains. The VAE alone cannot guarantee perfect temporal consistency; for best results, combine with a video‑aware diffusion scheduler.

🚧 Current limitations

  • Encoder remains untouched – it will still blur OOD modalities (e.g. HDR, medical imaging).
  • Trained only up to 480 p; decoding 720p - 1080 p works but may soften fine print / small logos.
  • Not validated on minors, non‑consensual or illegal content scenes – strictly prohibited by license.

🔏 Responsible publication

This repository is released under the CreativeML OpenRAIL‑M licence.
By downloading or finetuning the weights you agree not to:

  1. Distribute or generate content that is sexual involving minors, exploitative, non‑consensual or otherwise illegal.
  2. Violate any applicable jurisdictional laws or platform ToS.
  3. Remove or alter this model card or licence notice.

For questions about usage boundaries, please open an Issue or contact the maintainers privately.


✍️ Citation

If you use NSFW‑Wan‑VAE in academic work, please cite:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support