Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models
Authors: Zhening Xing, Gereon Fox, Yanhong Zeng, Xingang Pan, Mohamed Elgharib, Christian Theobalt, Kai Chen †(†: corresponding author)
Key Features
- Uni-directional Temporal Attention with Warmup Mechanism
- Multitimestep KV-Cache for Temporal Attention during Inference
- Depth Prior for Better Structure Consistency
- Compatible with DreamBooth and LoRA for Various Styles
- TensorRT Supported
The speed evaluation is conducted on Ubuntu 20.04.6 LTS and Pytorch 2.2.2 with RTX 4090 GPU and Intel(R) Xeon(R) Platinum 8352V CPU. Denoising steps are set as 2.
Resolution | TensorRT | FPS |
---|---|---|
512 x 512 | On | 16.43 |
512 x 512 | Off | 6.91 |
768 x 512 | On | 12.15 |
768 x 512 | Off | 6.29 |
Real-Time Video2Video Demo
Human Face (Web Camera Input) |
Anime Character (Screen Video Input) |
Acknowledgements
The video and image demos in this GitHub repository were generated using LCM-LoRA. Stream batch in StreamDiffusion is used for model acceleration. The design of Video Diffusion Model is adopted from AnimateDiff. We use a third-party implementation of MiDaS implementation which support onnx export. Our online demo is modified from Real-Time-Latent-Consistency-Model.
BibTex
If you find it helpful, please consider citing our work:
@article{xing2024live2diff,
title={Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models},
author={Zhening Xing and Gereon Fox and Yanhong Zeng and Xingang Pan and Mohamed Elgharib and Christian Theobalt and Kai Chen},
booktitle={arXiv preprint arxiv:2407.08701},
year={2024}
}