EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation
Terminal Technology Department, Alipay, Ant Group.
1Core Contributor 2Corresponding Authors
🚀 EchoMimic Series
- EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation. GitHub
- EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation. GitHub
- EchoMimicV1: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditioning. GitHub
📣 Updates
- [2025.07.08] 🔥 Our paper is in public on arxiv.
🌅 Gallery
Quick Start
Environment Setup
- Tested System Environment: Centos 7.2/Ubuntu 22.04, Cuda >= 12.1
- Tested GPUs: A100(80G) / RTX4090D (24G) / V100(16G)
- Tested Python Version: 3.10 / 3.11
🛠️Installation
1. Create a conda environment and install pytorch, xformers
conda create -n echomimic_v3 python=3.10
conda activate echomimic_v3
2. Other dependencies
pip install -r requirements.txt
🧱Model Preparation
Models | Download Link | Notes |
---|---|---|
Wan2.1-Fun-1.3B-InP | 🤗 Huggingface | Base model |
wav2vec2-base | 🤗 Huggingface | Audio encoder |
EchoMimicV3 | 🤗 Huggingface | Our weights |
-- The weights is organized as follows.
./models/
├── Wan2.1-Fun-1.3B-InP
├── wav2vec2-base-960h
└── transformer
└── diffusion_pytorch_model.safetensors
### 🔑 Quick Inference
python infer.py
> Tips
> - Audio CFG: Audio CFG works optimally between 2~3. Increase the audio CFG value for better lip synchronization, while decreasing the audio CFG value can improve the visual quality.
> - Text CFG: Text CFG works optimally between 4~6. Increase the text CFG value for better prompt following, while decreasing the text CFG value can improve the visual quality.
> - TeaCache: The optimal range for `--teacache_thresh` is between 0~0.1.
> - Sampling steps: 5 steps for talking head, 15~25 steps for talking body.
> - Long video generation: If you want to generate a video longer than 138 frames, you can use Long Video CFG.
## 📝 TODO List
| Status | Milestone |
|:--------:|:-------------------------------------------------------------------------|
| 2025.08.08 | The inference code of EchoMimicV3 meet everyone on GitHub |
| 🚀 | Preview version Pretrained models trained on English and Chinese on HuggingFace |
| 🚀 | Preview version Pretrained models trained on English and Chinese on ModelScope |
| 🚀 | 720P Pretrained models trained on English and Chinese on HuggingFace |
| 🚀 | 720P Pretrained models trained on English and Chinese on ModelScope |
| 🚀 | The training code of EchoMimicV3 meet everyone on GitHub |
## 📒 Citation
If you find our work useful for your research, please consider citing the paper :
@misc{meng2025echomimicv3, title={EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation}, author={Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, Chenguang Ma}, year={2025}, eprint={2507.03905}, archivePrefix={arXiv} }
## 🌟 Star History
[](https://star-history.com/#antgroup/echomimic_v3&Date)
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support