leo-guo commited on
Commit
45440eb
·
verified ·
1 Parent(s): 7ccc335

Init Readme.md

Browse files
Files changed (1) hide show
  1. README.md +102 -3
README.md CHANGED
@@ -1,3 +1,102 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+
4
+ tags:
5
+ - tokenization
6
+ - video generation
7
+ - vae
8
+ ---
9
+
10
+ # VidTwin
11
+ Video VAE with Decoupled Structure and Dynamics
12
+
13
+ <img src="./assets/vidtwin_demo.png" width="95%" alt="demo" align="center">
14
+
15
+ We propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: **Structure latent vectors**, which capture overall content and global movement, and **Dynamics latent vectors**, which represent fine-grained details and rapid movements.
16
+
17
+ Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation.
18
+
19
+ Resources and technical documentation:
20
+
21
+ + [GitHub](https://github.com/microsoft/VidTok/tree/main/vidtwin)
22
+ + [arXiv](https://arxiv.org/pdf/2412.17726)
23
+
24
+ ## Setup
25
+
26
+ 1. Our code is based on **VidTok**, so you will need to install the [required packages for VidTok](https://github.com/microsoft/VidTok?tab=readme-ov-file#setup) first. To do so, navigate to the VidTok folder and create the environment using the `environment.yaml` file:
27
+
28
+ ```bash
29
+ cd VidTok
30
+ # Prepare conda environment
31
+ conda env create -f environment.yaml
32
+ # Activate the environment
33
+ conda activate vidtok
34
+ ```
35
+
36
+ 2. After setting up VidTok, install the additional packages required for the VidTwin model:
37
+ ```bash
38
+ pip install tranformers
39
+ pip install timm
40
+ pip install flash-attn --no-build-isolation
41
+ ```
42
+
43
+
44
+ ## Training
45
+
46
+ Please refer to the [paper](https://arxiv.org/pdf/2412.17726) and [code](https://github.com/microsoft/VidTok/tree/main/vidtwin) for detailed training instructions.
47
+
48
+ ## Inference
49
+
50
+ Please refer to the [paper](https://arxiv.org/pdf/2412.17726) and [code](https://github.com/microsoft/VidTok/tree/main/vidtwin) for detailed inference instructions.
51
+
52
+ ## Intended Uses
53
+
54
+ We are sharing our model with the research community to foster further research in this area:
55
+ * Training your own video tokenizers for research purpose.
56
+ * Video tokenization with various compression rates.
57
+
58
+ ## Downstream Uses
59
+
60
+ Our model is designed to accelerate research on video-centric research, for use as a building block for the following applications:
61
+ * Video generation on the continuous / discrete latent tokens.
62
+ * World modelling on the continuous / discrete latent tokens.
63
+ * Generative games on the continuous / discrete latent tokens.
64
+ * Video understanding from the latent tokens.
65
+
66
+ ## Out-of-scope Uses
67
+
68
+ Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of video tokenizers (e.g., performance degradation on out-of-domain data) as they select use cases, and evaluate and mitigate for privacy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.
69
+
70
+ Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.
71
+
72
+ ## Risks and Limitations
73
+
74
+ Some of the limitations of this model to be aware of include:
75
+ * VidTwin may lose detailed information on the reconstructed content.
76
+ * VidTwin inherits any biases, errors, or omissions characteristic of its training data.
77
+ * VidTwin was developed for research and experimental purposes. Further testing and validation are needed before considering its application in commercial or real-world scenarios.
78
+
79
+ ## Recommendations
80
+
81
+ Some recommendations for alleviating potential limitations include:
82
+ * Lower compression rate provides higher reconstruction quality.
83
+ * For domain-specific video tokenization, it is suggested to fine-tune the model on the domain-specific videos.
84
+
85
+ ## License
86
+
87
+ The model is released under the [MIT license](https://github.com/microsoft/VidTok/blob/main/LICENSE).
88
+
89
+ ## Contact
90
+
91
+ We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at [email protected].
92
+
93
+ ## BibTeX
94
+ If you find our project helpful to your research, please consider starring this repository🌟 and citing our paper.
95
+ ```bibtex
96
+ @article{wang2024vidtwin,
97
+ title={VidTwin: Video VAE with Decoupled Structure and Dynamics},
98
+ author={Wang, Yuchi and Guo, Junliang and Xie, Xinyi and He, Tianyu and Sun, Xu and Bian, Jiang},
99
+ year={2024},
100
+ journal={arXiv preprint arXiv:2412.17726},
101
+ }
102
+ ```