File size: 12,817 Bytes
fbf0bf5
 
 
4bef936
 
6049764
4bef936
 
fbf0bf5
 
3ab7262
 
 
fbf0bf5
4bef936
631d2b7
 
a793814
631d2b7
6049764
631d2b7
 
 
 
a793814
631d2b7
6049764
 
631d2b7
 
 
a793814
631d2b7
a793814
 
 
3ce060a
a793814
 
631d2b7
 
3ce060a
631d2b7
 
 
6049764
631d2b7
 
 
6049764
 
a793814
 
3ce060a
 
631d2b7
 
 
 
 
 
 
4bef936
 
 
 
 
 
 
 
 
 
 
 
 
 
 
631d2b7
 
 
 
 
 
 
 
 
 
4bef936
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6049764
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
631d2b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6049764
 
631d2b7
6049764
 
631d2b7
fbf0bf5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
---
base_model:
- THUDM/CogVideoX-5b
language:
- en
library_name: diffusers
license: other
pipeline_tag: text-to-video
tags:
- video
- video-generation
- cogvideox
- alibaba
---

<div align="center">

<img src="icon.jpg" width="250"/>

<h2><center>[🔥ACM MM'25]Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation</h2>

Zhenghao Zhang\*, Junchao Liao\*, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang

\* equal contribution
<br>

<a href='https://huggingface.co/papers/2507.05963'><img src='https://img.shields.io/badge/Paper-Tora2-red'></a>
<a href='https://ali-videoai.github.io/Tora2_page/'><img src='https://img.shields.io/badge/Project-Page-Blue'></a>
<a href="https://github.com/alibaba/Tora"><img src='https://img.shields.io/badge/Github-Link-orange'></a>
<a href='https://www.modelscope.cn/studios/xiaoche/Tora'><img src='https://img.shields.io/badge/🤖_ModelScope-ZH_demo-%23654dfc'></a>
<a href='https://www.modelscope.cn/studios/Alibaba_Research_Intelligence_Computing/Tora_En'><img src='https://img.shields.io/badge/🤖_ModelScope-EN_demo-%23654dfc'></a>
<br>

<a href='https://modelscope.cn/models/xiaoche/Tora'><img src='https://img.shields.io/badge/🤖_ModelScope-T2V/I2V_weights(SAT)-%23654dfc'></a>
<a href='https://modelscope.cn/models/Alibaba_Research_Intelligence_Computing/Tora_T2V_diffusers'><img src='https://img.shields.io/badge/🤖_ModelScope-T2V_weights(diffusers)-%23654dfc'></a>
<br>

<a href='https://huggingface.co/Alibaba-Research-Intelligence-Computing/Tora'><img src='https://img.shields.io/badge/🤗_HuggingFace-T2V/I2V_weights(SAT)-%23ff9e0e'></a>
<a href='https://huggingface.co/Alibaba-Research-Intelligence-Computing/Tora_T2V_diffusers'><img src='https://img.shields.io/badge/🤗_HuggingFace-T2V_weights(diffusers)-%23ff9e0e'></a>
</div>

## Please visit our [Github repo](https://github.com/alibaba/Tora) for more details.

## 💡 Abstract

Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. Specifically, we introduce a decoupled personalization extractor that generates comprehensive personalization embeddings for multiple open-set entities, better preserving fine-grained visual details compared to previous methods. Building on this, we design a gated self-attention mechanism to integrate trajectory, textual description, and visual information for each entity. This innovation significantly reduces misalignment in multimodal conditioning during training. Moreover, we introduce a contrastive loss that jointly optimizes trajectory dynamics and entity consistency through explicit mapping between motion and personalization embeddings. Tora2 is, to our best knowledge, the first method to achieve simultaneous multi-entity customization of appearance and motion for video generation. Experimental results demonstrate that Tora2 achieves competitive performance with state-of-the-art customization methods while providing advanced motion control capabilities, which marks a critical advancement in multi-condition video generation. Project page: this https URL .

## 📣 Updates

- `2025/07/08` 🔥🔥 Our latest work, [Tora2](https://ali-videoai.github.io/Tora2_page/), has been accepted by ACM MM25. Tora2 builds on Tora with design improvements, enabling enhanced appearance and motion customization for multiple entities.
- `2025/05/24` We open-sourced a LoRA-finetuned model of [Wan](https://github.com/Wan-Video/Wan2.1). It turns things in the image into fluffy toys. Check this out: https://github.com/alibaba/wan-toy-transform
- `2025/01/06` 🔥🔥We released Tora Image-to-Video, including inference code and model weights.
- `2024/12/13` SageAttention2 and model compilation are supported in diffusers version. Tested on the A10, these approaches speed up every inference step by approximately 52%, except for the first step.
- `2024/12/09` 🔥🔥Diffusers version of Tora and the corresponding model weights are released. Inference VRAM requirements are reduced to around 5 GiB. Please refer to [this](diffusers-version/README.md) for details.
- `2024/11/25` 🔥Text-to-Video training code released.
- `2024/10/31` Model weights uploaded to [HuggingFace](https://huggingface.co/Le0jc/Tora). We also provided an English demo on [ModelScope](https://www.modelscope.cn/studios/Alibaba_Research_Intelligence_Computing/Tora_En).
- `2024/10/23` 🔥🔥Our [ModelScope Demo](https://www.modelscope.cn/studios/xiaoche/Tora) is launched. Welcome to try it out! We also upload the model weights to [ModelScope](https://www.modelscope.cn/models/xiaoche/Tora).
- `2024/10/21` Thanks to [@kijai](https://github.com/kijai) for supporting Tora in ComfyUI! [Link](https://github.com/kijai/ComfyUI-CogVideoXWrapper)
- `2024/10/15` 🔥🔥We released our inference code and model weights. **Please note that this is a CogVideoX version of Tora, built on the CogVideoX-5B model. This version of Tora is meant for academic research purposes only. Due to our commercial plans, we will not be open-sourcing the complete version of Tora at this time.**
- `2024/08/27` We released our v2 paper including appendix.
- `2024/07/31` We submitted our paper on arXiv and released our project page.

## 📑 Table of Contents

- [🎞️ Showcases](#%EF%B8%8F-showcases)
- [✅ TODO List](#-todo-list)
- [🧨 Diffusers verision](#-diffusers-verision)
- [🐍 Installation](#-installation)
- [📦 Model Weights](#-model-weights)
- [🔄 Inference](#-inference)
- [🖥️ Gradio Demo](#%EF%B8%8F-gradio-demo)
- [🧠 Training](#-training)
- [🎯 Troubleshooting](#-troubleshooting)
- [🤝 Acknowledgements](#-acknowledgements)
- [📄 Our previous work](#-our-previous-work)
- [📚 Citation](#-citation)

## 🎞️ Showcases

https://github.com/user-attachments/assets/949d5e99-18c9-49d6-b669-9003ccd44bf1

https://github.com/user-attachments/assets/7e7dbe87-a8ba-4710-afd0-9ef528ec329b

https://github.com/user-attachments/assets/4026c23d-229d-45d7-b5be-6f3eb9e4fd50

All videos are available in this [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/showcases.zip)

## ✅ TODO List

- [x] Release our inference code and model weights
- [x] Provide a ModelScope Demo
- [x] Release our training code
- [x] Release diffusers version and optimize the GPU memory usage
- [x] Release complete version of Tora

## 📦 Model Weights

### Folder Structure

```
Tora
└── sat
    └── ckpts
        ├── t5-v1_1-xxl
        │   ├── model-00001-of-00002.safetensors
        │   └── ...
        ├── vae
        │   └── 3d-vae.pt
        ├── tora
        │   ├── i2v
        │   │   └── mp_rank_00_model_states.pt
        │   └── t2v
        │       └── mp_rank_00_model_states.pt
        └── CogVideoX-5b-sat # for training stage 1
            └── mp_rank_00_model_states.pt
```

### Download Links

*Note: Downloading the `tora` weights requires following the [CogVideoX License](CogVideoX_LICENSE).* You can choose one of the following options: HuggingFace, ModelScope, or native links.\
After downloading the model weights, you can put them in the `Tora/sat/ckpts` folder.

#### HuggingFace

```bash
# This can be faster
pip install "huggingface_hub[hf_transfer]"
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Alibaba-Research-Intelligence-Computing/Tora --local-dir ckpts
```

or

```bash
# use git
git lfs install
git clone https://huggingface.co/Alibaba-Research-Intelligence-Computing/Tora
```

#### ModelScope

- SDK

```bash
from modelscope import snapshot_download
model_dir = snapshot_download('xiaoche/Tora')
```

- Git

```bash
git clone https://www.modelscope.cn/xiaoche/Tora.git
```

#### Native

- Download the VAE and T5 model following [CogVideo](https://github.com/THUDM/CogVideo/blob/main/sat/README.md#2-download-model-weights):\
    - VAE: https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
    - T5: [text_encoder](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/text_encoder), [tokenizer](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/tokenizer)
- Tora t2v model weights: [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/mp_rank_00_model_states.pt). Downloading this weight requires following the [CogVideoX License](CogVideoX_LICENSE).

## 🔄 Inference

### Text to Video
It requires around 30 GiB GPU memory tested on NVIDIA A100.

```bash
cd sat
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU sample_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/inference_sparse.yaml --load ckpts/tora/t2v --output-dir samples --point_path trajs/coaster.txt --input-file assets/text/t2v/examples.txt
```

You can change the `--input-file` and `--point_path` to your own prompts and trajectory points files. Please note that the trajectory is drawn on a 256x256 canvas.

Replace `$N_GPU` with the number of GPUs you want to use.

### Image to Video

```bash
cd sat
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU sample_video.py --base configs/tora/model/cogvideox_5b_tora_i2v.yaml configs/tora/inference_sparse.yaml --load ckpts/tora/i2v --output-dir samples --point_path trajs/sawtooth.txt --input-file assets/text/i2v/examples.txt --img_dir assets/images --image2video
```

The first frame images should be placed in the `--img_dir`. The names of these images should be specified in the corresponding text prompt in `--input-file`, seperated by `@@`.

### Recommendations for Text Prompts

For text prompts, we highly recommend using GPT-4 to enhance the details. Simple prompts may negatively impact both visual quality and motion control effectiveness.

You can refer to the following resources for guidance:

- [CogVideoX Documentation](https://github.com/THUDM/CogVideo/blob/main/inference/convert_demo.py)
- [OpenSora Scripts](https://github.com/hpcaitech/Open-Sora/blob/main/scripts/inference.py)

## 🖥️ Gradio Demo

Usage:

```bash
cd sat
python app.py --load ckpts/tora/t2v
```

## 🧠 Training

### Data Preparation

Following this guide https://github.com/THUDM/CogVideo/blob/main/sat/README.md#preparing-the-dataset, structure the datasets as follows:

```
.
├── labels
│   ├── 1.txt
│   ├── 2.txt
│   ├── ...
└── videos
    ├── 1.mp4
    ├── 2.mp4
    ├── ...
```

Training data examples are in `sat/training_examples`

### Text to Video

It requires around 60 GiB GPU memory tested on NVIDIA A100.

Replace `$N_GPU` with the number of GPUs you want to use.

- Stage 1

```bash
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU train_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/train_dense.yaml --experiment-name "t2v-stage1"
```

- Stage 2

```bash
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU train_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/train_sparse.yaml --experiment-name "t2v-stage2"
```

## 🎯 Troubleshooting

### 1. ValueError: Non-consecutive added token...

Upgrade the transformers package to 4.44.2. See [this](https://github.com/THUDM/CogVideo/issues/213) issue.

## 🤝 Acknowledgements

We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:

- [CogVideo](https://github.com/THUDM/CogVideo): An open source video generation framework by THUKEG.
- [Open-Sora](https://github.com/hpcaitech/Open-Sora): An open source video generation framework by HPC-AI Tech.
- [MotionCtrl](https://github.com/TencentARC/MotionCtrl): A video generation model supporting motion control by ARC Lab, Tencent PCG.
- [ComfyUI-DragNUWA](https://github.com/chaojie/ComfyUI-DragNUWA): An implementation of DragNUWA for ComfyUI.

Special thanks to the contributors of these libraries for their hard work and dedication!

## 📄 Our previous work

- [AnimateAnything: Fine Grained Open Domain Image Animation with Motion Guidance](https://github.com/alibaba/animate-anything)

## 📚 Citation

```bibtex
@article{zhang2025tora2,
      title={Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation},
      author={Zhenghao Zhang and Junchao Liao and Menghao Li and Zuozhuo Dai and Bingxue Qiu and Siyu Zhu and Long Qin and Weizhi Wang},
      journal={ACM Multimedia (MM)},
      year={2025}
}
```