Update model card for Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +96 -12
README.md CHANGED
@@ -3,6 +3,7 @@ base_model:
3
  - THUDM/CogVideoX-5b
4
  language:
5
  - en
 
6
  license: other
7
  pipeline_tag: text-to-video
8
  tags:
@@ -10,22 +11,21 @@ tags:
10
  - video-generation
11
  - cogvideox
12
  - alibaba
13
- library_name: pytorch
14
  ---
15
 
16
  <div align="center">
17
 
18
  <img src="icon.jpg" width="250"/>
19
 
20
- <h2><center>[🔥CVPR'25]Tora: Trajectory-oriented Diffusion Transformer for Video Generation</h2>
21
 
22
  Zhenghao Zhang\*, Junchao Liao\*, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang
23
 
24
  \* equal contribution
25
  <br>
26
 
27
- <a href='https://arxiv.org/abs/2407.21705'><img src='https://img.shields.io/badge/ArXiv-2407.21705-red'></a>
28
- <a href='https://ali-videoai.github.io/tora_video/'><img src='https://img.shields.io/badge/Project-Page-Blue'></a>
29
  <a href="https://github.com/alibaba/Tora"><img src='https://img.shields.io/badge/Github-Link-orange'></a>
30
  <a href='https://www.modelscope.cn/studios/xiaoche/Tora'><img src='https://img.shields.io/badge/🤖_ModelScope-ZH_demo-%23654dfc'></a>
31
  <a href='https://www.modelscope.cn/studios/Alibaba_Research_Intelligence_Computing/Tora_En'><img src='https://img.shields.io/badge/🤖_ModelScope-EN_demo-%23654dfc'></a>
@@ -43,10 +43,12 @@ Zhenghao Zhang\*, Junchao Liao\*, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu
43
 
44
  ## 💡 Abstract
45
 
46
- Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiT’s scalability, allowing precise control of video content’s dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Tora’s excellence in achieving high motion fidelity, while also meticulously simulating the movement of physical world.
47
 
48
  ## 📣 Updates
49
 
 
 
50
  - `2025/01/06` 🔥🔥We released Tora Image-to-Video, including inference code and model weights.
51
  - `2024/12/13` SageAttention2 and model compilation are supported in diffusers version. Tested on the A10, these approaches speed up every inference step by approximately 52%, except for the first step.
52
  - `2024/12/09` 🔥🔥Diffusers version of Tora and the corresponding model weights are released. Inference VRAM requirements are reduced to around 5 GiB. Please refer to [this](diffusers-version/README.md) for details.
@@ -156,6 +158,91 @@ git clone https://www.modelscope.cn/xiaoche/Tora.git
156
  - T5: [text_encoder](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/text_encoder), [tokenizer](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/tokenizer)
157
  - Tora t2v model weights: [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/mp_rank_00_model_states.pt). Downloading this weight requires following the [CogVideoX License](CogVideoX_LICENSE).
158
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
  ## 🤝 Acknowledgements
160
 
161
  We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:
@@ -174,13 +261,10 @@ Special thanks to the contributors of these libraries for their hard work and de
174
  ## 📚 Citation
175
 
176
  ```bibtex
177
- @misc{zhang2024toratrajectoryorienteddiffusiontransformer,
178
- title={Tora: Trajectory-oriented Diffusion Transformer for Video Generation},
179
  author={Zhenghao Zhang and Junchao Liao and Menghao Li and Zuozhuo Dai and Bingxue Qiu and Siyu Zhu and Long Qin and Weizhi Wang},
180
- year={2024},
181
- eprint={2407.21705},
182
- archivePrefix={arXiv},
183
- primaryClass={cs.CV},
184
- url={https://arxiv.org/abs/2407.21705},
185
  }
186
  ```
 
3
  - THUDM/CogVideoX-5b
4
  language:
5
  - en
6
+ library_name: diffusers
7
  license: other
8
  pipeline_tag: text-to-video
9
  tags:
 
11
  - video-generation
12
  - cogvideox
13
  - alibaba
 
14
  ---
15
 
16
  <div align="center">
17
 
18
  <img src="icon.jpg" width="250"/>
19
 
20
+ <h2><center>[🔥ACM MM'25]Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation</h2>
21
 
22
  Zhenghao Zhang\*, Junchao Liao\*, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang
23
 
24
  \* equal contribution
25
  <br>
26
 
27
+ <a href='https://huggingface.co/papers/2507.05963'><img src='https://img.shields.io/badge/Paper-Tora2-red'></a>
28
+ <a href='https://ali-videoai.github.io/Tora2_page/'><img src='https://img.shields.io/badge/Project-Page-Blue'></a>
29
  <a href="https://github.com/alibaba/Tora"><img src='https://img.shields.io/badge/Github-Link-orange'></a>
30
  <a href='https://www.modelscope.cn/studios/xiaoche/Tora'><img src='https://img.shields.io/badge/🤖_ModelScope-ZH_demo-%23654dfc'></a>
31
  <a href='https://www.modelscope.cn/studios/Alibaba_Research_Intelligence_Computing/Tora_En'><img src='https://img.shields.io/badge/🤖_ModelScope-EN_demo-%23654dfc'></a>
 
43
 
44
  ## 💡 Abstract
45
 
46
+ Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. Specifically, we introduce a decoupled personalization extractor that generates comprehensive personalization embeddings for multiple open-set entities, better preserving fine-grained visual details compared to previous methods. Building on this, we design a gated self-attention mechanism to integrate trajectory, textual description, and visual information for each entity. This innovation significantly reduces misalignment in multimodal conditioning during training. Moreover, we introduce a contrastive loss that jointly optimizes trajectory dynamics and entity consistency through explicit mapping between motion and personalization embeddings. Tora2 is, to our best knowledge, the first method to achieve simultaneous multi-entity customization of appearance and motion for video generation. Experimental results demonstrate that Tora2 achieves competitive performance with state-of-the-art customization methods while providing advanced motion control capabilities, which marks a critical advancement in multi-condition video generation. Project page: this https URL .
47
 
48
  ## 📣 Updates
49
 
50
+ - `2025/07/08` 🔥🔥 Our latest work, [Tora2](https://ali-videoai.github.io/Tora2_page/), has been accepted by ACM MM25. Tora2 builds on Tora with design improvements, enabling enhanced appearance and motion customization for multiple entities.
51
+ - `2025/05/24` We open-sourced a LoRA-finetuned model of [Wan](https://github.com/Wan-Video/Wan2.1). It turns things in the image into fluffy toys. Check this out: https://github.com/alibaba/wan-toy-transform
52
  - `2025/01/06` 🔥🔥We released Tora Image-to-Video, including inference code and model weights.
53
  - `2024/12/13` SageAttention2 and model compilation are supported in diffusers version. Tested on the A10, these approaches speed up every inference step by approximately 52%, except for the first step.
54
  - `2024/12/09` 🔥🔥Diffusers version of Tora and the corresponding model weights are released. Inference VRAM requirements are reduced to around 5 GiB. Please refer to [this](diffusers-version/README.md) for details.
 
158
  - T5: [text_encoder](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/text_encoder), [tokenizer](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/tokenizer)
159
  - Tora t2v model weights: [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/mp_rank_00_model_states.pt). Downloading this weight requires following the [CogVideoX License](CogVideoX_LICENSE).
160
 
161
+ ## 🔄 Inference
162
+
163
+ ### Text to Video
164
+ It requires around 30 GiB GPU memory tested on NVIDIA A100.
165
+
166
+ ```bash
167
+ cd sat
168
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU sample_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/inference_sparse.yaml --load ckpts/tora/t2v --output-dir samples --point_path trajs/coaster.txt --input-file assets/text/t2v/examples.txt
169
+ ```
170
+
171
+ You can change the `--input-file` and `--point_path` to your own prompts and trajectory points files. Please note that the trajectory is drawn on a 256x256 canvas.
172
+
173
+ Replace `$N_GPU` with the number of GPUs you want to use.
174
+
175
+ ### Image to Video
176
+
177
+ ```bash
178
+ cd sat
179
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU sample_video.py --base configs/tora/model/cogvideox_5b_tora_i2v.yaml configs/tora/inference_sparse.yaml --load ckpts/tora/i2v --output-dir samples --point_path trajs/sawtooth.txt --input-file assets/text/i2v/examples.txt --img_dir assets/images --image2video
180
+ ```
181
+
182
+ The first frame images should be placed in the `--img_dir`. The names of these images should be specified in the corresponding text prompt in `--input-file`, seperated by `@@`.
183
+
184
+ ### Recommendations for Text Prompts
185
+
186
+ For text prompts, we highly recommend using GPT-4 to enhance the details. Simple prompts may negatively impact both visual quality and motion control effectiveness.
187
+
188
+ You can refer to the following resources for guidance:
189
+
190
+ - [CogVideoX Documentation](https://github.com/THUDM/CogVideo/blob/main/inference/convert_demo.py)
191
+ - [OpenSora Scripts](https://github.com/hpcaitech/Open-Sora/blob/main/scripts/inference.py)
192
+
193
+ ## 🖥️ Gradio Demo
194
+
195
+ Usage:
196
+
197
+ ```bash
198
+ cd sat
199
+ python app.py --load ckpts/tora/t2v
200
+ ```
201
+
202
+ ## 🧠 Training
203
+
204
+ ### Data Preparation
205
+
206
+ Following this guide https://github.com/THUDM/CogVideo/blob/main/sat/README.md#preparing-the-dataset, structure the datasets as follows:
207
+
208
+ ```
209
+ .
210
+ ├── labels
211
+ │ ├── 1.txt
212
+ │ ├── 2.txt
213
+ │ ├── ...
214
+ └── videos
215
+ ├── 1.mp4
216
+ ├── 2.mp4
217
+ ├── ...
218
+ ```
219
+
220
+ Training data examples are in `sat/training_examples`
221
+
222
+ ### Text to Video
223
+
224
+ It requires around 60 GiB GPU memory tested on NVIDIA A100.
225
+
226
+ Replace `$N_GPU` with the number of GPUs you want to use.
227
+
228
+ - Stage 1
229
+
230
+ ```bash
231
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU train_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/train_dense.yaml --experiment-name "t2v-stage1"
232
+ ```
233
+
234
+ - Stage 2
235
+
236
+ ```bash
237
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU train_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/train_sparse.yaml --experiment-name "t2v-stage2"
238
+ ```
239
+
240
+ ## 🎯 Troubleshooting
241
+
242
+ ### 1. ValueError: Non-consecutive added token...
243
+
244
+ Upgrade the transformers package to 4.44.2. See [this](https://github.com/THUDM/CogVideo/issues/213) issue.
245
+
246
  ## 🤝 Acknowledgements
247
 
248
  We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:
 
261
  ## 📚 Citation
262
 
263
  ```bibtex
264
+ @article{zhang2025tora2,
265
+ title={Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation},
266
  author={Zhenghao Zhang and Junchao Liao and Menghao Li and Zuozhuo Dai and Bingxue Qiu and Siyu Zhu and Long Qin and Weizhi Wang},
267
+ journal={ACM Multimedia (MM)},
268
+ year={2025}
 
 
 
269
  }
270
  ```