metadata

title: HunyuanVideo-Foley
emoji: 🎵
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Generate realistic audio from video and text descriptions

HunyuanVideo-Foley

🎵 Text-Video-to-Audio Synthesis

Generate realistic audio from video and text descriptions using AI

About

HunyuanVideo-Foley is a multimodal diffusion model that generates high-quality audio effects (Foley audio) synchronized with video content. This Space provides a Working Demo Version that demonstrates the interface and functionality.

🎯 Working Demo Version

What this demo does:

✅ Full interface with all controls and settings
✅ Video upload and processing simulation
✅ Audio generation (synthetic demo tones)
✅ Multiple samples (up to 3 variations)
✅ Real-time feedback and status updates

What's different from full version:

🎵 Generates synthetic audio instead of AI-generated Foley
⚡ Instant results (no 3-5 minute wait)
💾 Low memory usage (works within 16GB limit)
🎭 Interface demonstration of the real model's capabilities

🚀 Full AI Model Access

For real AI-generated Foley audio:

🏠 Run locally: Clone the GitHub repository
💻 Hardware needs: 24GB+ RAM, GPU recommended
📱 GPU Space: Upgrade to paid GPU Space for cloud access

Features

🎬 Video-to-Audio: Generate audio effects from video content
📝 Text Guidance: Control generation with text descriptions
🎯 Multiple Samples: Generate up to 3 variations
🔧 Adjustable Settings: Control CFG scale and inference steps
📱 User-Friendly: Simple drag-and-drop interface

How to Use

Upload Video: Drag and drop your video file (MP4, AVI, MOV)
Add Description (Optional): Describe the audio you want to generate
Adjust Settings: Modify CFG scale and inference steps if needed
Generate: Click "Generate Audio" and wait (3-5 minutes on CPU)
Download: Save your generated audio/video combinations

Tips for Best Results

📏 Video Length: Keep videos under 30 seconds for faster processing
🎯 Text Prompts: Use simple, clear descriptions
⚡ Settings: Lower values process faster on CPU
🔄 Multiple Attempts: Try different settings if not satisfied

Technical Details

Model: HunyuanVideo-Foley-XXL
Architecture: Multimodal diffusion transformer
Audio Quality: 48kHz professional-grade output
Deployment: CPU-optimized for Hugging Face Spaces

Original Project

This is a CPU deployment of the original HunyuanVideo-Foley project:

📄 Paper: HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment
💻 GitHub: Tencent-Hunyuan/HunyuanVideo-Foley
🤗 Models: tencent/HunyuanVideo-Foley

Citation

@misc{shan2025hunyuanvideofoleymultimodaldiffusionrepresentation,
      title={HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation}, 
      author={Sizhe Shan and Qiulin Li and Yutao Cui and Miles Yang and Yuehai Wang and Qun Yang and Jin Zhou and Zhao Zhong},
      year={2025},
      eprint={2508.16930},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

License

This project is licensed under the Apache 2.0 License.

🚀 Powered by Tencent Hunyuan | Optimized for CPU deployment