Spaces:
Running
Running
metadata
title: HunyuanVideo-Foley
emoji: π΅
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Generate realistic audio from video and text descriptions
HunyuanVideo-Foley
π΅ Text-Video-to-Audio Synthesis
Generate realistic audio from video and text descriptions using AI
About
HunyuanVideo-Foley is a multimodal diffusion model that generates high-quality audio effects (Foley audio) synchronized with video content. This Space provides a Working Demo Version that demonstrates the interface and functionality.
π― Working Demo Version
What this demo does:
- β Full interface with all controls and settings
- β Video upload and processing simulation
- β Audio generation (synthetic demo tones)
- β Multiple samples (up to 3 variations)
- β Real-time feedback and status updates
What's different from full version:
- π΅ Generates synthetic audio instead of AI-generated Foley
- β‘ Instant results (no 3-5 minute wait)
- πΎ Low memory usage (works within 16GB limit)
- π Interface demonstration of the real model's capabilities
π Full AI Model Access
For real AI-generated Foley audio:
- π Run locally: Clone the GitHub repository
- π» Hardware needs: 24GB+ RAM, GPU recommended
- π± GPU Space: Upgrade to paid GPU Space for cloud access
Features
- π¬ Video-to-Audio: Generate audio effects from video content
- π Text Guidance: Control generation with text descriptions
- π― Multiple Samples: Generate up to 3 variations
- π§ Adjustable Settings: Control CFG scale and inference steps
- π± User-Friendly: Simple drag-and-drop interface
How to Use
- Upload Video: Drag and drop your video file (MP4, AVI, MOV)
- Add Description (Optional): Describe the audio you want to generate
- Adjust Settings: Modify CFG scale and inference steps if needed
- Generate: Click "Generate Audio" and wait (3-5 minutes on CPU)
- Download: Save your generated audio/video combinations
Tips for Best Results
- π Video Length: Keep videos under 30 seconds for faster processing
- π― Text Prompts: Use simple, clear descriptions
- β‘ Settings: Lower values process faster on CPU
- π Multiple Attempts: Try different settings if not satisfied
Technical Details
- Model: HunyuanVideo-Foley-XXL
- Architecture: Multimodal diffusion transformer
- Audio Quality: 48kHz professional-grade output
- Deployment: CPU-optimized for Hugging Face Spaces
Original Project
This is a CPU deployment of the original HunyuanVideo-Foley project:
- π Paper: HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment
- π» GitHub: Tencent-Hunyuan/HunyuanVideo-Foley
- π€ Models: tencent/HunyuanVideo-Foley
Citation
@misc{shan2025hunyuanvideofoleymultimodaldiffusionrepresentation,
title={HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation},
author={Sizhe Shan and Qiulin Li and Yutao Cui and Miles Yang and Yuehai Wang and Qun Yang and Jin Zhou and Zhao Zhong},
year={2025},
eprint={2508.16930},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
License
This project is licensed under the Apache 2.0 License.
π Powered by Tencent Hunyuan | Optimized for CPU deployment