hunyuanvideo-foley / README.md
wzy013's picture
Create working demo version that actually runs
e78e3fd
|
raw
history blame
3.84 kB
metadata
title: HunyuanVideo-Foley
emoji: 🎡
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Generate realistic audio from video and text descriptions

HunyuanVideo-Foley

🎡 Text-Video-to-Audio Synthesis

Generate realistic audio from video and text descriptions using AI

About

HunyuanVideo-Foley is a multimodal diffusion model that generates high-quality audio effects (Foley audio) synchronized with video content. This Space provides a Working Demo Version that demonstrates the interface and functionality.

🎯 Working Demo Version

What this demo does:

  • βœ… Full interface with all controls and settings
  • βœ… Video upload and processing simulation
  • βœ… Audio generation (synthetic demo tones)
  • βœ… Multiple samples (up to 3 variations)
  • βœ… Real-time feedback and status updates

What's different from full version:

  • 🎡 Generates synthetic audio instead of AI-generated Foley
  • ⚑ Instant results (no 3-5 minute wait)
  • πŸ’Ύ Low memory usage (works within 16GB limit)
  • 🎭 Interface demonstration of the real model's capabilities

πŸš€ Full AI Model Access

For real AI-generated Foley audio:

  • 🏠 Run locally: Clone the GitHub repository
  • πŸ’» Hardware needs: 24GB+ RAM, GPU recommended
  • πŸ“± GPU Space: Upgrade to paid GPU Space for cloud access

Features

  • 🎬 Video-to-Audio: Generate audio effects from video content
  • πŸ“ Text Guidance: Control generation with text descriptions
  • 🎯 Multiple Samples: Generate up to 3 variations
  • πŸ”§ Adjustable Settings: Control CFG scale and inference steps
  • πŸ“± User-Friendly: Simple drag-and-drop interface

How to Use

  1. Upload Video: Drag and drop your video file (MP4, AVI, MOV)
  2. Add Description (Optional): Describe the audio you want to generate
  3. Adjust Settings: Modify CFG scale and inference steps if needed
  4. Generate: Click "Generate Audio" and wait (3-5 minutes on CPU)
  5. Download: Save your generated audio/video combinations

Tips for Best Results

  • πŸ“ Video Length: Keep videos under 30 seconds for faster processing
  • 🎯 Text Prompts: Use simple, clear descriptions
  • ⚑ Settings: Lower values process faster on CPU
  • πŸ”„ Multiple Attempts: Try different settings if not satisfied

Technical Details

  • Model: HunyuanVideo-Foley-XXL
  • Architecture: Multimodal diffusion transformer
  • Audio Quality: 48kHz professional-grade output
  • Deployment: CPU-optimized for Hugging Face Spaces

Original Project

This is a CPU deployment of the original HunyuanVideo-Foley project:

Citation

@misc{shan2025hunyuanvideofoleymultimodaldiffusionrepresentation,
      title={HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation}, 
      author={Sizhe Shan and Qiulin Li and Yutao Cui and Miles Yang and Yuehai Wang and Qun Yang and Jin Zhou and Zhao Zhong},
      year={2025},
      eprint={2508.16930},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

License

This project is licensed under the Apache 2.0 License.


πŸš€ Powered by Tencent Hunyuan | Optimized for CPU deployment