microsoft
/

vidtwin

tokenization

video generation

vae

Model card Files Files and versions Community

leo-guo commited on Mar 20

Commit

45440eb

verified ·

1 Parent(s): 7ccc335

Init Readme.md

Browse files

Files changed (1) hide show

README.md +102 -3

README.md CHANGED Viewed

@@ -1,3 +1,102 @@
----
-license: mit
----

+---
+license: mit
+tags:
+- tokenization
+- video generation
+- vae
+---
+# VidTwin
+Video VAE with Decoupled Structure and Dynamics
+<img src="./assets/vidtwin_demo.png" width="95%" alt="demo" align="center">
+We propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: **Structure latent vectors**, which capture overall content and global movement, and **Dynamics latent vectors**, which represent fine-grained details and rapid movements.
+Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation.
+Resources and technical documentation:
++ [GitHub](https://github.com/microsoft/VidTok/tree/main/vidtwin)
++ [arXiv](https://arxiv.org/pdf/2412.17726)
+## Setup
+1. Our code is based on **VidTok**, so you will need to install the [required packages for VidTok](https://github.com/microsoft/VidTok?tab=readme-ov-file#setup) first. To do so, navigate to the VidTok folder and create the environment using the `environment.yaml` file:
+```bash
+cd VidTok
+# Prepare conda environment
+conda env create -f environment.yaml
+# Activate the environment
+conda activate vidtok
+```
+2. After setting up VidTok, install the additional packages required for the VidTwin model:
+```bash
+pip install tranformers
+pip install timm
+pip install flash-attn --no-build-isolation
+```
+## Training
+Please refer to the [paper](https://arxiv.org/pdf/2412.17726) and [code](https://github.com/microsoft/VidTok/tree/main/vidtwin) for detailed training instructions.
+## Inference
+Please refer to the [paper](https://arxiv.org/pdf/2412.17726) and [code](https://github.com/microsoft/VidTok/tree/main/vidtwin) for detailed inference instructions.
+## Intended Uses
+We are sharing our model with the research community to foster further research in this area:
+* Training your own video tokenizers for research purpose.
+* Video tokenization with various compression rates.
+## Downstream Uses
+Our model is designed to accelerate research on video-centric research, for use as a building block for the following applications:
+* Video generation on the continuous / discrete latent tokens.
+* World modelling on the continuous / discrete latent tokens.
+* Generative games on the continuous / discrete latent tokens.
+* Video understanding from the latent tokens.
+## Out-of-scope Uses
+Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of video tokenizers (e.g., performance degradation on out-of-domain data) as they select use cases, and evaluate and mitigate for privacy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.
+Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.
+## Risks and Limitations
+Some of the limitations of this model to be aware of include:
+* VidTwin may lose detailed information on the reconstructed content.
+* VidTwin inherits any biases, errors, or omissions characteristic of its training data.
+* VidTwin was developed for research and experimental purposes. Further testing and validation are needed before considering its application in commercial or real-world scenarios.
+## Recommendations
+Some recommendations for alleviating potential limitations include:
+* Lower compression rate provides higher reconstruction quality.
+* For domain-specific video tokenization, it is suggested to fine-tune the model on the domain-specific videos.
+## License
+The model is released under the [MIT license](https://github.com/microsoft/VidTok/blob/main/LICENSE).
+## Contact
+We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at [email protected].
+## BibTeX
+If you find our project helpful to your research, please consider starring this repository🌟 and citing our paper.
+```bibtex
+@article{wang2024vidtwin,
+  title={VidTwin: Video VAE with Decoupled Structure and Dynamics},
+  author={Wang, Yuchi and Guo, Junliang and Xie, Xinyi and He, Tianyu and Sun, Xu and Bian, Jiang},
+  year={2024},
+  journal={arXiv preprint arXiv:2412.17726},
+}
+```