CascadeV | An Implemention of Würstchen architecture for High-Resolution Video Generation

News

[2024.07.17] We release the code and pretrained weights of a DiT-based video VAE, which supports video reconstruction with a high compression factor (1x32x32=1024). The T2V model is still on the way.

Introduction

CascadeV is a video generation pipeline built upon the Würstchen architecture. By using a highly compressed latent representation, we can generate longer videos with higher resolution.

Video VAE

Comparison of Our Cascade Approach with Other VAEs (on Latent Space of Shape 8x32x32)

Video Recontruction: Original (left) vs. Reconstructed (right) | Click to view the videos

1. Model Architecture

1.1 DiT

We use PixArt-Σ as our base model with the following modifications:

  • Replace the original VAE (of SDXL) with the one from Stable Video Diffusion.
  • Use sematic compressor from StableCascade to provide the low-resolution latent input.
  • Remove text encoder and all multi-head cross-attention layers since we are not using text condition.
  • Replace all 2D attention layers to 3D. We find that 3D attention outperforms 2+1D (i.e. alternative spatial and temporal attention), especially in temporal consistency.

Comparison of 2+1D Attention (left) vs. 3D Attention (right)

1.2. Grid Attention

Using 3D attention requires much more computational resources than 2D/2+1D, especially with higher resolution. As a compromise solution, we replace some 3D attention layers with alternative spatial and temporal grid attention.

2. Evaluation

Dataset: We perform qualitative comparison with other baselines on the dataset Inter4K, by sampling the first 200 videos from the Inter4K to create a video dataset with a resolution of 1024x1024 and 30 FPS.

Metrics: We use PSNR, SSIM and LPIPS to evaluate the per-frame quality (and the similarity between original and reconstructed video) and VBench to evaluate the video quality independently.

2.1 PSNR/SSIM/LPIPS

Diffusion-based VAEs (like StableCascade and our model) performs poorly in reconstruction metrics, due to their ability to produce videos with more fine-grained details but less similiar to the original ones.

Model/Compression Factor PSNR↑ SSIM↑ LPIPS↓
Open-Sora-Plan v1.1/4x8x8=256 25.7282 0.8000 0.1030
EasyAnimate v3/4x8x8=256 28.8666 0.8505 0.0818
StableCascade/1x32x32=1024 24.3336 0.6896 0.1395
Ours/1x32x32=1024 23.7320 0.6742 0.1786

2.2 VBench

Our approach has comparable performance to the previous VAEs in both frame-wise and temporal quality even with much larger compression factor.

Model/Compression Factor Subject Consistency Background Consistency Temporal Flickering Motion Smoothness Imaging Quality Aesthetic Quality
Open-Sora-Plan v1.1/4x8x8=256 0.9519 0.9618 0.9573 0.9789 0.6791 0.5450
EasyAnimate v3/4x8x8=256 0.9578 0.9695 0.9615 0.9845 0.6735 0.5535
StableCascade/1x32x32=1024 0.9490 0.9517 0.9430 0.9639 0.6811 0.5675
Ours/1x32x32=1024 0.9601 0.9679 0.9626 0.9837 0.6747 0.5579

3. Usage

3.1 Installation

Recommend to use Conda

conda create -n cascadev python==3.9.0
conda activate cascadev

Install PixArt-Σ

bash install.sh

3.2 Download Pretrained Weights

bash pretrained/download.sh

3.3 Video Reconstruction

A sample script for video reconstruction with compression factor of 32

bash recon.sh

Results of Video Reconstruction: w/o LDM (left) vs. w/ LDM (right)

It takes almost 1 minutes to reconstruct a video of shape 8x1024x1024 with one NVIDIA-A800

3.4 Train VAE

  • Replace "video_list" in configs/s1024.effn-f32.py with your own video datasets
  • Then run
bash train_vae.sh

Acknowledgement

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .