Diffusers
Safetensors

EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation

Terminal Technology Department, Alipay, Ant Group.

1Core Contributor  2Corresponding Authors

🚀 EchoMimic Series

  • EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation. GitHub
  • EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation. GitHub
  • EchoMimicV1: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditioning. GitHub

📣 Updates

  • [2025.07.08] 🔥 Our paper is in public on arxiv.

🌅 Gallery

For more demo videos, please refer to the project page.

Quick Start

Environment Setup

  • Tested System Environment: Centos 7.2/Ubuntu 22.04, Cuda >= 12.1
  • Tested GPUs: A100(80G) / RTX4090D (24G) / V100(16G)
  • Tested Python Version: 3.10 / 3.11

🛠️Installation

1. Create a conda environment and install pytorch, xformers

conda create -n echomimic_v3 python=3.10
conda activate echomimic_v3

2. Other dependencies

pip install -r requirements.txt

🧱Model Preparation

Models Download Link Notes
Wan2.1-Fun-1.3B-InP 🤗 Huggingface Base model
wav2vec2-base 🤗 Huggingface Audio encoder
EchoMimicV3 🤗 Huggingface Our weights

-- The weights is organized as follows.

./models/
├── Wan2.1-Fun-1.3B-InP
├── wav2vec2-base-960h
└── transformer
    └── diffusion_pytorch_model.safetensors
### 🔑 Quick Inference

python infer.py

> Tips
> - Audio CFG: Audio CFG works optimally between 2~3. Increase the audio CFG value for better lip synchronization, while decreasing the audio CFG value can improve the visual quality.
> - Text CFG: Text CFG works optimally between 4~6. Increase the text CFG value for better prompt following, while decreasing the text CFG value can improve the visual quality.
> - TeaCache: The optimal range for `--teacache_thresh` is between 0~0.1.
> - Sampling steps: 5 steps for talking head, 15~25 steps for talking body. 
> - ​Long video generation: If you want to generate a video longer than 138 frames, you can use Long Video CFG.


## 📝 TODO List
| Status | Milestone                                                                |     
|:--------:|:-------------------------------------------------------------------------|
|    2025.08.08    | The inference code of EchoMimicV3 meet everyone on GitHub   | 
|    🚀    | Preview version Pretrained models trained on English and Chinese on HuggingFace | 
|    🚀    | Preview version Pretrained models trained on English and Chinese on ModelScope   | 
|    🚀    | 720P Pretrained models trained on English and Chinese on HuggingFace | 
|    🚀    | 720P Pretrained models trained on English and Chinese on ModelScope   | 
|    🚀    | The training code of EchoMimicV3 meet everyone on GitHub   | 



## 📒 Citation

If you find our work useful for your research, please consider citing the paper :

@misc{meng2025echomimicv3, title={EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation}, author={Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, Chenguang Ma}, year={2025}, eprint={2507.03905}, archivePrefix={arXiv} }


## 🌟 Star History
[![Star History Chart](https://api.star-history.com/svg?repos=antgroup/echomimic_v3&type=Date)](https://star-history.com/#antgroup/echomimic_v3&Date)
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support