Diffusers
File size: 5,074 Bytes
c67a966
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
license: apache-2.0
---
<meta name="google-site-verification" content="-XQC-POJtlDPD3i2KSOxbFkSBde_Uq9obAIh_4mxTkM" />

<div align="center">

<h2><a href="https://arxiv.org/abs/2408.10605">MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation</a></h2>

> Official project page of **MTVCrafter**, a novel framework for general and high-quality human image animation using raw 3D motion sequences.

<!--
[Yanbo Ding](https://github.com/DINGYANB),
[Shaobin Zhuang](https://scholar.google.com/citations?user=PGaDirMAAAAJ&hl=zh-CN&oi=ao), 
[Kunchang Li](https://scholar.google.com/citations?user=D4tLSbsAAAAJ), 
[Zhengrong Yue](https://arxiv.org/search/?searchtype=author&query=Zhengrong%20Yue), 
[Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ&hl), 
[Yali Wang†](https://scholar.google.com/citations?user=hD948dkAAAAJ)
-->

[![arXiv](https://img.shields.io/badge/arXiv-2408.10605-b31b1b.svg)](https://www.arxiv.org/abs/2505.10238)
[![GitHub](https://img.shields.io/badge/GitHub-MTVCrafter-blue?logo=github)](https://github.com/DINGYANB/MTVCrafter)
[![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow)](https://huggingface.co/yanboding/)
[![Project Page](https://img.shields.io/badge/🌐%20Page-GitHub.io-brightgreen)](https://dingyanb.github.io/MTVCtafter/)

</div>


## πŸ” Abstract

Human image animation has attracted increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information.  
To tackle these problems, we propose **MTVCrafter (Motion Tokenization Video Crafter)**, the first framework that directly models raw 3D motion sequences for open-world human image animation beyond intermediate 2D representations.

- We introduce **4DMoT (4D motion tokenizer)** to encode raw motion data into discrete motion tokens, preserving 4D compact yet expressive spatio-temporal information.
- Then, we propose **MV-DiT (Motion-aware Video DiT)**, which integrates a motion attention module and 4D positional encodings to effectively modulate vision tokens with motion tokens.
- The overall pipeline facilitates high-quality human video generation guided by 4D motion tokens.

MTVCrafter achieves **state-of-the-art results with an FID-VID of 6.98**, outperforming the second-best by approximately **65%**. It generalizes well to diverse characters (single/multiple, full/half-body) across various styles.

## 🎯 Motivation

![Motivation](./static/images/Motivation.png)

Our motivation is that directly tokenizing 4D motion captures more faithful and expressive information than traditional 2D-rendered pose images derived from the driven video.

## πŸ’‘ Method

![Method](./static/images/4DMoT.png)

*(1) 4DMoT*:
Our 4D motion tokenizer consists of an encoder-decoder framework to learn spatio-temporal latent representations of SMPL motion sequences,
and a vector quantizer to learn discrete tokens in a unified space.
All operations are performed in 2D space along frame and joint axes.

![Method](./static/images/MV-DiT.png)

*(2) MV-DiT*:
Based on video DiT architecture,
we design a 4D motion attention module to combine motion tokens with vision tokens.
Since the tokenization and flattening disrupted positional information,
we introduce 4D RoPE to recover the spatio-temporal relationships.
To further improve the quality of generation and generalization,
we use learnable unconditional tokens for motion classifier-free guidance.

---

## πŸ› οΈ Installation

We recommend using a clean Python environment (Python 3.10+).

```bash
clone this repository && cd MTVCrafter

# Create virtual environment
conda create -n mtvcrafter python=3.11
conda activate mtvcrafter

# Install dependencies
pip install -r requirements.txt
```

## πŸš€ Usage

To animate a human image with a given 3D motion sequence,
you first need to obtain the SMPL motion sequnces from the driven video:

```bash
python process_nlf.py "your_video_directory"
```

Then, you can use the following command to animate the image guided by 4D motion tokens:

```bash
python infer.py --ref_image_path "ref_images/hunam.png" --motion_data_path "data/sample_data.pkl" --output_path "inference_output"
```

- `--ref_image_path`: Path to the image of reference character.
- `--motion_data_path`: Path to the motion sequence (.pkl format).
- `--output_path`: Where to save the generated animation results.

For our 4DMoT, you can run the following command to train the model on your dataset:

```bash
accelerate launch train_vqvae.py
```

## πŸ“„ Citation

If you find our work useful, please consider citing:

```bibtex
@article{ding2025mtvcrafter,
  title={MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation},
  author={Ding, Yanbo},
  journal={arXiv preprint arXiv:2505.10238},
  year={2025}
}
```

## πŸ“¬ Contact

For questions or collaboration, feel free to reach out via GitHub Issues
or email me at πŸ“§ [email protected].