Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models

Authors: Zhening Xing, Gereon Fox, Yanhong Zeng, Xingang Pan, Mohamed Elgharib, Christian Theobalt, Kai Chen † (†: corresponding author)

arXivProject PageGithub Repo

Key Features

  • Uni-directional Temporal Attention with Warmup Mechanism
  • Multitimestep KV-Cache for Temporal Attention during Inference
  • Depth Prior for Better Structure Consistency
  • Compatible with DreamBooth and LoRA for Various Styles
  • TensorRT Supported

The speed evaluation is conducted on Ubuntu 20.04.6 LTS and Pytorch 2.2.2 with RTX 4090 GPU and Intel(R) Xeon(R) Platinum 8352V CPU. Denoising steps are set as 2.

Resolution TensorRT FPS
512 x 512 On 16.43
512 x 512 Off 6.91
768 x 512 On 12.15
768 x 512 Off 6.29

Real-Time Video2Video Demo

Human Face (Web Camera Input)

Anime Character (Screen Video Input)

Acknowledgements

The video and image demos in this GitHub repository were generated using LCM-LoRA. Stream batch in StreamDiffusion is used for model acceleration. The design of Video Diffusion Model is adopted from AnimateDiff. We use a third-party implementation of MiDaS implementation which support onnx export. Our online demo is modified from Real-Time-Latent-Consistency-Model.

BibTex

If you find it helpful, please consider citing our work:

@article{xing2024live2diff,
  title={Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models},
  author={Zhening Xing and Gereon Fox and Yanhong Zeng and Xingang Pan and Mohamed Elgharib and Christian Theobalt and Kai Chen},
  booktitle={arXiv preprint arxiv:2407.08701},
  year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .