Spaces:

jintinghou
/

fantasy-talking-demo

Running

App Files Files Community

fantasy-talking-demo / README.md

jintinghou

Fix YAML metadata: emoji and short_description

2eb1d70 3 months ago

preview code

raw

history blame

3.08 kB

metadata

title: FantasyTalking Demo
emoji: 🎬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.34.2
app_file: app.py
pinned: false
license: apache-2.0
short_description: Talking Portrait Generation Demo

FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis

This is a Hugging Face Space demo for the FantasyTalking project, which generates realistic talking portraits from a single image and audio input.

🔥 Features

Single Image Input: Generate talking videos from just one portrait image
Audio-driven Animation: Synchronize lip movements with input audio
High Quality Output: 512x512 resolution with up to 81 frames
Controllable Generation: Adjust prompt and audio guidance scales

📋 Requirements

Due to the large model size (~40GB+) and GPU memory requirements, this demo shows the interface but requires local deployment for full functionality.

System Requirements

NVIDIA GPU with at least 5GB VRAM (low memory mode)
20GB+ VRAM recommended for optimal performance
50GB+ storage space for models

🚀 Local Deployment

To run FantasyTalking locally with full functionality:

# 1. Clone the repository
git clone https://github.com/Fantasy-AMAP/fantasy-talking.git
cd fantasy-talking

# 2. Install dependencies
pip install -r requirements.txt
pip install flash_attn  # Optional, for accelerated attention computation

# 3. Download models
# Base model (~20GB)
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./models/Wan2.1-I2V-14B-720P

# Audio encoder (~1GB)
huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./models/wav2vec2-base-960h

# FantasyTalking weights (~2GB)
huggingface-cli download acvlab/FantasyTalking fantasytalking_model.ckpt --local-dir ./models

# 4. Run inference
python infer.py --image_path ./assets/images/woman.png --audio_path ./assets/audios/woman.wav

# 5. Start web interface
python app.py

🎯 Performance

Model performance on single A100 (512x512, 81 frames):

torch_dtype	num_persistent_param_in_dit	Speed	Required VRAM
torch.bfloat16	None (unlimited)	15.5s/it	40G
torch.bfloat16	7×10⁹ (7B)	32.8s/it	20G
torch.bfloat16	0	42.6s/it	5G

📖 Citation

@article{wang2025fantasytalking,
   title={FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis},
   author={Wang, Mengchao and Wang, Qiang and Jiang, Fan and Fan, Yaqi and Zhang, Yunpeng and Qi, Yonggang and Zhao, Kun and Xu, Mu},
   journal={arXiv preprint arXiv:2504.04842},
   year={2025}
}

🔗 Links

Paper: arXiv:2504.04842
Code: GitHub Repository
Models: Hugging Face
Project Page: FantasyTalking

📄 License

This project is licensed under the Apache-2.0 License.