File size: 5,006 Bytes
c67a966 84b5da8 c67a966 3606bbf f169c49 fb4ca6f c67a966 56e545d c67a966 904c95c c67a966 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
---
license: apache-2.0
---
<meta name="google-site-verification" content="-XQC-POJtlDPD3i2KSOxbFkSBde_Uq9obAIh_4mxTkM" />
<div align="center">
<h2><a href="https://www.arxiv.org/abs/2505.10238">MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation</a></h2>
> Official project page of **MTVCrafter**, a novel framework for general and high-quality human image animation using raw 3D motion sequences.
<!--
[Yanbo Ding](https://github.com/DINGYANB),
[Shaobin Zhuang](https://scholar.google.com/citations?user=PGaDirMAAAAJ&hl=zh-CN&oi=ao),
[Kunchang Li](https://scholar.google.com/citations?user=D4tLSbsAAAAJ),
[Zhengrong Yue](https://arxiv.org/search/?searchtype=author&query=Zhengrong%20Yue),
[Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ&hl),
[Yali Wangβ ](https://scholar.google.com/citations?user=hD948dkAAAAJ)
-->
π [Project Page](https://dingyanb.github.io/MTVCtafter/) |
π [ArXiv](https://arxiv.org/abs/2505.10238) |
π» [Code](https://github.com/DINGYANB/MTVCrafter) |
π€ [Hugging Face Model](https://huggingface.co/yanboding/MTVCrafter)
</div>
## π Abstract
Human image animation has attracted increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information.
To tackle these problems, we propose **MTVCrafter (Motion Tokenization Video Crafter)**, the first framework that directly models raw 3D motion sequences for open-world human image animation beyond intermediate 2D representations.
- We introduce **4DMoT (4D motion tokenizer)** to encode raw motion data into discrete motion tokens, preserving 4D compact yet expressive spatio-temporal information.
- Then, we propose **MV-DiT (Motion-aware Video DiT)**, which integrates a motion attention module and 4D positional encodings to effectively modulate vision tokens with motion tokens.
- The overall pipeline facilitates high-quality human video generation guided by 4D motion tokens.
MTVCrafter achieves **state-of-the-art results with an FID-VID of 6.98**, outperforming the second-best by approximately **65%**. It generalizes well to diverse characters (single/multiple, full/half-body) across various styles.
## π― Motivation

Our motivation is that directly tokenizing 4D motion captures more faithful and expressive information than traditional 2D-rendered pose images derived from the driven video.
## π‘ Method

*(1) 4DMoT*:
Our 4D motion tokenizer consists of an encoder-decoder framework to learn spatio-temporal latent representations of SMPL motion sequences,
and a vector quantizer to learn discrete tokens in a unified space.
All operations are performed in 2D space along frame and joint axes.

*(2) MV-DiT*:
Based on video DiT architecture,
we design a 4D motion attention module to combine motion tokens with vision tokens.
Since the tokenization and flattening disrupted positional information,
we introduce 4D RoPE to recover the spatio-temporal relationships.
To further improve the quality of generation and generalization,
we use learnable unconditional tokens for motion classifier-free guidance.
---
## π οΈ Installation
We recommend using a clean Python environment (Python 3.10+).
```bash
clone this repository && cd MTVCrafter-main
# Create virtual environment
conda create -n mtvcrafter python=3.11
conda activate mtvcrafter
# Install dependencies
pip install -r requirements.txt
```
## π Usage
To animate a human image with a given 3D motion sequence,
you first need to obtain the SMPL motion sequnces from the driven video:
```bash
python process_nlf.py "your_video_directory"
```
Then, you can use the following command to animate the image guided by 4D motion tokens:
```bash
python infer.py --ref_image_path "ref_images/hunam.png" --motion_data_path "data/sample_data.pkl" --output_path "inference_output"
```
- `--ref_image_path`: Path to the image of reference character.
- `--motion_data_path`: Path to the motion sequence (.pkl format).
- `--output_path`: Where to save the generated animation results.
For our 4DMoT, you can run the following command to train the model on your dataset:
```bash
accelerate launch train_vqvae.py
```
## π Citation
If you find our work useful, please consider citing:
```bibtex
@misc{ding2025mtvcrafter4dmotiontokenization,
title={MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation},
author={Yanbo Ding and Xirui Hu and Zhizhi Guo and Yali Wang},
year={2025},
eprint={2505.10238},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.10238},
}
```
## π¬ Contact
For questions or collaboration, feel free to reach out via GitHub Issues
or email me at π§ [email protected]. |