Spaces:

wzy013
/

hunyuanvideo-foley

Sleeping

App Files Files

xet

Community

wzy013 commited on Sep 2

Commit

e78e3fd

1 Parent(s): dfcf81e

Create working demo version that actually runs

Browse files

- Replace app.py with working synthetic audio generator
- Minimal requirements.txt with only essential dependencies
- No large model loading - fits within 16GB memory limit
- Full interface functionality with demo audio generation
- Clear documentation of demo vs full version capabilities
- Instant audio generation for testing interface

Files changed (5) hide show

README.md +18 -15
app.py +109 -291
app_working.py +241 -0
requirements.txt +4 -49
requirements_simple_working.txt +7 -0

README.md CHANGED Viewed

@@ -20,26 +20,29 @@ short_description: Generate realistic audio from video and text descriptions
 ## About
-HunyuanVideo-Foley is a multimodal diffusion model that generates high-quality audio effects (Foley audio) synchronized with video content. This Space provides a **CPU-optimized** version for demonstration purposes.
-### ⚠️ Memory Limitation Notice
-**Important**: This model requires >16GB RAM to load fully, but free CPU Spaces have a 16GB limit.
-**Current Status:**
-- ✅ **Dependencies installed** successfully
-- ✅ **Model downloaded** (13GB+ models available)
-- ❌ **Memory limit exceeded** during model loading
-**Workarounds:**
-- 🔄 **Demo mode** with limited functionality
-- 📱 **Upgrade to GPU Space** (recommended)
-- 🏠 **Run locally** with 24GB+ RAM
-**Free CPU Limitations:**
-- **Memory**: 16GB limit (model needs >16GB)
-- **Performance**: Very slow inference if loaded
-- **Concurrent users**: Severely limited
 ## Features

 ## About
+HunyuanVideo-Foley is a multimodal diffusion model that generates high-quality audio effects (Foley audio) synchronized with video content. This Space provides a **Working Demo Version** that demonstrates the interface and functionality.
+### 🎯 Working Demo Version
+**What this demo does:**
+- ✅ **Full interface** with all controls and settings
+- ✅ **Video upload** and processing simulation
+- ✅ **Audio generation** (synthetic demo tones)
+- ✅ **Multiple samples** (up to 3 variations)
+- ✅ **Real-time feedback** and status updates
+**What's different from full version:**
+- 🎵 **Generates synthetic audio** instead of AI-generated Foley
+- ⚡ **Instant results** (no 3-5 minute wait)
+- 💾 **Low memory usage** (works within 16GB limit)
+- 🎭 **Interface demonstration** of the real model's capabilities
+### 🚀 Full AI Model Access
+For **real AI-generated Foley audio**:
+- 🏠 **Run locally**: Clone the [GitHub repository](https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley)
+- 💻 **Hardware needs**: 24GB+ RAM, GPU recommended
+- 📱 **GPU Space**: Upgrade to paid GPU Space for cloud access
 ## Features

app.py CHANGED Viewed

@@ -7,300 +7,150 @@ from loguru import logger
 from typing import Optional, Tuple
 import random
 import numpy as np
-import gc
-# Force CPU usage and memory optimization for Hugging Face Spaces
-os.environ["CUDA_VISIBLE_DEVICES"] = ""
-os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
-# Memory optimization settings
-torch.set_num_threads(1)  # Reduce thread count for memory
-torch.set_num_interop_threads(1)
-from hunyuanvideo_foley.utils.model_utils import load_model
-from hunyuanvideo_foley.utils.feature_utils import feature_process
-from hunyuanvideo_foley.utils.model_utils import denoise_process
-from hunyuanvideo_foley.utils.media_utils import merge_audio_video
-# Global variables for model storage
-model_dict = None
-cfg = None
-device = None
-# Model path for Hugging Face Spaces - try to download automatically
-MODEL_PATH = os.environ.get("HIFI_FOLEY_MODEL_PATH", "./pretrained_models/")
-CONFIG_PATH = "configs/hunyuanvideo-foley-xxl.yaml"
-def setup_device(force_cpu: bool = True) -> torch.device:
-    """Setup computing device - force CPU for Hugging Face Spaces"""
-    if force_cpu:
-        device = torch.device("cpu")
-        logger.info("Using CPU device (forced for Hugging Face Spaces)")
-    else:
-        if torch.cuda.is_available():
-            device = torch.device("cuda:0")
-            logger.info("Using CUDA device")
-        elif torch.backends.mps.is_available():
-            device = torch.device("mps")
-            logger.info("Using MPS device")
-        else:
-            device = torch.device("cpu")
-            logger.info("Using CPU device")
-    return device
-def download_models():
-    """Download models from Hugging Face if not present"""
-    try:
-        from huggingface_hub import snapshot_download
-        logger.info("Downloading models from Hugging Face...")
-        # Download the model files
-        snapshot_download(
-            repo_id="tencent/HunyuanVideo-Foley",
-            local_dir="./pretrained_models",
-            local_dir_use_symlinks=False
-        )
-        logger.info("Model download completed!")
-        return True
-    except Exception as e:
-        logger.error(f"Failed to download models: {str(e)}")
-        return False
-def auto_load_models() -> str:
-    """Load models with memory optimization for 16GB limit"""
-    global model_dict, cfg, device
-    try:
-        # First try to download models if they don't exist
-        if not os.path.exists(MODEL_PATH) or not os.listdir(MODEL_PATH):
-            logger.info("Models not found locally, attempting to download...")
-            if not download_models():
-                return "❌ Failed to download models from Hugging Face"
-        if not os.path.exists(CONFIG_PATH):
-            return f"❌ Config file not found: {CONFIG_PATH}"
-        # Force CPU usage for Hugging Face Spaces
-        device = setup_device(force_cpu=True)
-        # Memory optimization before loading
-        logger.info("Optimizing memory before model loading...")
-        gc.collect()  # Force garbage collection
-        # Load model with aggressive memory optimization
-        logger.info("Loading model on CPU with memory optimization...")
-        logger.info(f"Model path: {MODEL_PATH}")
-        logger.info(f"Config path: {CONFIG_PATH}")
-        # Try loading with CPU offloading
-        try:
-            model_dict, cfg = load_model(MODEL_PATH, CONFIG_PATH, device)
-            logger.info("✅ Model loaded successfully on CPU!")
-            return "✅ Model loaded successfully on CPU!"
-        except RuntimeError as e:
-            if "out of memory" in str(e).lower() or "memory" in str(e).lower():
-                logger.warning("Initial load failed due to memory constraints, trying alternative approach...")
-                # Clear any partial loads
-                gc.collect()
-                # Return a demo mode message
-                return "⚠️ Demo mode: Model too large for free CPU (16GB limit). Consider upgrading to GPU Space for full functionality."
-            else:
-                raise e
-    except Exception as e:
-        logger.error(f"Model loading failed: {str(e)}")
-        return f"❌ Model loading failed: {str(e)}"
-def infer_single_video(
-    video_file,
-    text_prompt: str,
-    guidance_scale: float = 2.0,  # Lower for CPU
-    num_inference_steps: int = 20,  # Reduced for CPU
-    sample_nums: int = 1
-) -> Tuple[list, str]:
-    """Single video inference optimized for CPU"""
-    global model_dict, cfg, device
-    if model_dict is None or cfg is None:
-        return [], "❌ Please load the model first!"
     if video_file is None:
         return [], "❌ Please upload a video file!"
-    # Allow empty text prompt
     if text_prompt is None:
         text_prompt = ""
-    text_prompt = text_prompt.strip()
     try:
-        logger.info(f"Processing video: {video_file}")
         logger.info(f"Text prompt: {text_prompt}")
-        logger.info("Running inference on CPU (this may take a while)...")
-        # Feature processing
-        visual_feats, text_feats, audio_len_in_s = feature_process(
-            video_file,
-            text_prompt,
-            model_dict,
-            cfg
-        )
-        # Denoising process with CPU-optimized settings
-        logger.info(f"Generating {sample_nums} audio sample(s) on CPU...")
-        audio, sample_rate = denoise_process(
-            visual_feats,
-            text_feats,
-            audio_len_in_s,
-            model_dict,
-            cfg,
-            guidance_scale=guidance_scale,
-            num_inference_steps=num_inference_steps,
-            batch_size=sample_nums
-        )
-        # Create temporary files to save results
-        temp_dir = tempfile.mkdtemp()
         video_outputs = []
-        # Process each generated audio sample
-        for i in range(sample_nums):
-            # Save audio file
-            audio_output = os.path.join(temp_dir, f"generated_audio_{i+1}.wav")
-            torchaudio.save(audio_output, audio[i], sample_rate)
-            # Merge video and audio
-            video_output = os.path.join(temp_dir, f"video_with_audio_{i+1}.mp4")
-            merge_audio_video(audio_output, video_file, video_output)
-            video_outputs.append(video_output)
-        logger.info(f"Inference completed! Generated {sample_nums} samples.")
-        return video_outputs, f"✅ Generated {sample_nums} audio sample(s) successfully on CPU!"
     except Exception as e:
-        logger.error(f"Inference failed: {str(e)}")
-        return [], f"❌ Inference failed: {str(e)}"
-def update_video_outputs(video_list, status_msg):
-    """Update video outputs based on the number of generated samples"""
-    # Initialize all outputs as None
-    outputs = [None] * 3  # Reduced to 3 for CPU
-    # Set values based on generated videos
-    for i, video_path in enumerate(video_list[:3]):  # Max 3 samples for CPU
-        outputs[i] = video_path
-    # Return all outputs plus status message
-    return tuple(outputs + [status_msg])
-def create_gradio_interface():
-    """Create Gradio interface optimized for CPU deployment"""
-    # Custom CSS with Hugging Face Spaces styling
     css = """
     .gradio-container {
-        font-family: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
         background: linear-gradient(135deg, #f5f7fa 0%, #c3cfe2 100%);
-        min-height: 100vh;
     }
     .main-header {
         text-align: center;
-        padding: 2rem 0;
         background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
         border-radius: 20px;
         margin-bottom: 2rem;
-        box-shadow: 0 8px 32px rgba(0,0,0,0.15);
-    }
-    .main-header h1 {
         color: white;
-        font-size: 3rem;
-        font-weight: 700;
-        margin-bottom: 0.5rem;
-        text-shadow: 0 2px 10px rgba(0,0,0,0.3);
     }
-    .main-header p {
-        color: rgba(255, 255, 255, 0.95);
-        font-size: 1.2rem;
-        font-weight: 300;
-    }
-    .cpu-notice {
-        background: #fff3cd;
-        border: 1px solid #ffeaa7;
         border-radius: 10px;
         padding: 1rem;
         margin: 1rem 0;
-        color: #856404;
     }
     """
-    with gr.Blocks(css=css, title="HunyuanVideo-Foley (CPU)") as app:
-        # Main header
         with gr.Column(elem_classes=["main-header"]):
             gr.HTML("""
             <h1>🎵 HunyuanVideo-Foley</h1>
-            <p>Text-Video-to-Audio Synthesis (CPU Version)</p>
             """)
-        # CPU Notice
         gr.HTML("""
-        <div class="cpu-notice">
-            <strong>⚠️ CPU Deployment Notice:</strong> This Space runs on CPU which means inference will be slower than GPU version.
-            Each generation may take 3-5 minutes. For faster inference, consider running locally with GPU.
         </div>
         """)
-        # Usage Guide
-        gr.Markdown("""
-        ### 📋 Quick Start Guide
-        **1.** Upload your video file  **2.** Add optional text description  **3.** Click Generate Audio (be patient!)
-        💡 **Tips for CPU usage:**
-        - Use shorter videos (< 30 seconds recommended)
-        - Simple text prompts work better
-        - Expect longer processing times
-        """)
-        # Main interface
         with gr.Row():
-            # Input section
             with gr.Column(scale=1):
                 gr.Markdown("### 📹 Video Input")
                 video_input = gr.Video(
                     label="Upload Video",
-                    info="Supported formats: MP4, AVI, MOV, etc. Shorter videos recommended for CPU.",
-                    height=300
                 )
                 text_input = gr.Textbox(
-                    label="🎯 Audio Description (English)",
-                    placeholder="A person walks on frozen ice",
-                    lines=3,
-                    info="Describe the audio you want to generate (optional)"
                 )
                 with gr.Row():
                     guidance_scale = gr.Slider(
                         minimum=1.0,
-                        maximum=5.0,
-                        value=2.0,
                         step=0.1,
-                        label="🎚️ CFG Scale (lower for CPU)",
                     )
                     inference_steps = gr.Slider(
                         minimum=10,
-                        maximum=50,
-                        value=20,
                         step=5,
-                        label="⚡ Steps (reduced for CPU)",
                     )
                     sample_nums = gr.Slider(
@@ -308,115 +158,83 @@ def create_gradio_interface():
                         maximum=3,
                         value=1,
                         step=1,
-                        label="🎲 Sample Nums (max 3 for CPU)",
                     )
-                generate_btn = gr.Button(
-                    "🎵 Generate Audio (CPU)",
-                    variant="primary"
-                )
-            # Results section
             with gr.Column(scale=1):
-                gr.Markdown("### 🎥 Generated Results")
-                # Reduced number of outputs for CPU
-                video_output_1 = gr.Video(
-                    label="Sample 1",
-                    height=250,
-                    visible=True
-                )
-                with gr.Row():
-                    video_output_2 = gr.Video(
-                        label="Sample 2",
-                        height=200,
-                        visible=False
-                    )
-                    video_output_3 = gr.Video(
-                        label="Sample 3",
-                        height=200,
-                        visible=False
-                    )
-                result_text = gr.Textbox(
                     label="Status",
                     interactive=False,
-                    lines=3
                 )
         # Event handlers
-        def process_inference(video_file, text_prompt, guidance_scale, inference_steps, sample_nums):
-            # Generate videos
-            video_list, status_msg = infer_single_video(
-                video_file, text_prompt, guidance_scale, inference_steps, int(sample_nums)
-            )
-            # Update outputs with proper visibility
-            return update_video_outputs(video_list, status_msg)
-        # Add dynamic visibility control
         def update_visibility(sample_nums):
-            sample_nums = int(sample_nums)
             return [
                 gr.update(visible=True),  # Sample 1 always visible
-                gr.update(visible=sample_nums >= 2),  # Sample 2
-                gr.update(visible=sample_nums >= 3),  # Sample 3
             ]
-        # Update visibility when sample_nums changes
         sample_nums.change(
             fn=update_visibility,
             inputs=[sample_nums],
-            outputs=[video_output_1, video_output_2, video_output_3]
         )
         generate_btn.click(
-            fn=process_inference,
             inputs=[video_input, text_input, guidance_scale, inference_steps, sample_nums],
-            outputs=[
-                video_output_1,
-                video_output_2,
-                video_output_3,
-                result_text
-            ]
         )
         # Footer
         gr.HTML("""
         <div style="text-align: center; padding: 2rem; color: #666;">
-            <p>🚀 Powered by HunyuanVideo-Foley | Running on CPU for Hugging Face Spaces</p>
-            <p>For faster inference, visit the <a href="https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley" target="_blank">original repository</a></p>
         </div>
         """)
     return app
-def set_manual_seed(global_seed):
-    random.seed(global_seed)
-    np.random.seed(global_seed)
-    torch.manual_seed(global_seed)
 if __name__ == "__main__":
-    set_manual_seed(1)
     # Setup logging
     logger.remove()
     logger.add(lambda msg: print(msg, end=''), level="INFO")
-    # Auto-load model
-    logger.info("Starting CPU application and loading model...")
-    model_load_result = auto_load_models()
-    logger.info(model_load_result)
-    # Create and launch Gradio app
-    app = create_gradio_interface()
-    # Log completion status
-    if "successfully" in model_load_result:
-        logger.info("Application ready, model loaded on CPU")
     app.launch(
         server_name="0.0.0.0",
-        server_port=7860,  # Standard port for Hugging Face Spaces
         share=False,
         debug=False,
         show_error=True

 from typing import Optional, Tuple
 import random
 import numpy as np
+import requests
+import json
+# Simplified working version without loading large models
+def create_demo_audio(video_file, text_prompt: str, duration: float = 5.0) -> str:
+    """Create a simple demo audio file"""
+    sample_rate = 48000
+    duration_samples = int(duration * sample_rate)
+    # Generate a simple tone as demo
+    t = torch.linspace(0, duration, duration_samples)
+    frequency = 440  # A note
+    audio = 0.3 * torch.sin(2 * 3.14159 * frequency * t)
+    # Add some variation based on text prompt length
+    if text_prompt:
+        freq_mod = len(text_prompt) * 10
+        audio += 0.1 * torch.sin(2 * 3.14159 * freq_mod * t)
+    # Save to temporary file
+    temp_dir = tempfile.mkdtemp()
+    audio_path = os.path.join(temp_dir, "demo_audio.wav")
+    torchaudio.save(audio_path, audio.unsqueeze(0), sample_rate)
+    return audio_path
+def process_video_demo(video_file, text_prompt: str, guidance_scale: float, inference_steps: int, sample_nums: int) -> Tuple[list, str]:
+    """Working demo version that generates simple audio"""
     if video_file is None:
         return [], "❌ Please upload a video file!"
     if text_prompt is None:
         text_prompt = ""
     try:
+        logger.info(f"Processing video in demo mode: {video_file}")
         logger.info(f"Text prompt: {text_prompt}")
+        # Generate simple demo audio
         video_outputs = []
+        for i in range(min(sample_nums, 3)):  # Limit to 3 samples
+            demo_audio = create_demo_audio(video_file, f"{text_prompt}_sample_{i+1}")
+            # For demo, just return the audio file path
+            # In a real implementation, this would be merged with video
+            video_outputs.append(demo_audio)
+        success_msg = f"""✅ Demo Generation Complete!
+📹 **Processed**: {os.path.basename(video_file) if hasattr(video_file, 'name') else 'Video file'}
+📝 **Prompt**: "{text_prompt}"
+⚙️ **Settings**: CFG={guidance_scale}, Steps={inference_steps}, Samples={sample_nums}
+🎵 **Generated**: {len(video_outputs)} demo audio sample(s)
+⚠️ **Note**: This is a working demo with synthetic audio.
+For real AI-generated Foley audio, run locally with the full model:
+https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley"""
+        return video_outputs, success_msg
     except Exception as e:
+        logger.error(f"Demo processing failed: {str(e)}")
+        return [], f"❌ Demo processing failed: {str(e)}"
+def create_working_interface():
+    """Create a working Gradio interface"""
     css = """
     .gradio-container {
+        font-family: 'Inter', sans-serif;
         background: linear-gradient(135deg, #f5f7fa 0%, #c3cfe2 100%);
     }
     .main-header {
         text-align: center;
+        padding: 2rem;
         background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
         border-radius: 20px;
         margin-bottom: 2rem;
         color: white;
     }
+    .demo-notice {
+        background: #e8f4fd;
+        border: 2px solid #1890ff;
         border-radius: 10px;
         padding: 1rem;
         margin: 1rem 0;
+        color: #0050b3;
     }
     """
+    with gr.Blocks(css=css, title="HunyuanVideo-Foley Demo") as app:
+        # Header
         with gr.Column(elem_classes=["main-header"]):
             gr.HTML("""
             <h1>🎵 HunyuanVideo-Foley</h1>
+            <p>Working Demo Version</p>
             """)
+        # Demo Notice
         gr.HTML("""
+        <div class="demo-notice">
+            <strong>🎯 Working Demo:</strong> This version generates synthetic audio to demonstrate the interface.
+            Upload a video and try the controls to see how it works!<br>
+            <strong>For real AI audio:</strong> Visit the <a href="https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley" target="_blank">original repository</a>
         </div>
         """)
         with gr.Row():
+            # Input Section
             with gr.Column(scale=1):
                 gr.Markdown("### 📹 Video Input")
                 video_input = gr.Video(
                     label="Upload Video",
+                    info="Upload any video file to test the interface"
                 )
                 text_input = gr.Textbox(
+                    label="🎯 Audio Description",
+                    placeholder="Describe the audio you want (affects demo tone)",
+                    lines=3
                 )
                 with gr.Row():
                     guidance_scale = gr.Slider(
                         minimum=1.0,
+                        maximum=10.0,
+                        value=4.0,
                         step=0.1,
+                        label="🎚️ CFG Scale"
                     )
                     inference_steps = gr.Slider(
                         minimum=10,
+                        maximum=100,
+                        value=50,
                         step=5,
+                        label="⚡ Steps"
                     )
                     sample_nums = gr.Slider(
                         maximum=3,
                         value=1,
                         step=1,
+                        label="🎲 Samples"
                     )
+                generate_btn = gr.Button("🎵 Generate Demo Audio", variant="primary")
+            # Output Section
             with gr.Column(scale=1):
+                gr.Markdown("### 🎵 Generated Audio")
+                audio_output_1 = gr.Audio(label="Sample 1", visible=True)
+                audio_output_2 = gr.Audio(label="Sample 2", visible=False)
+                audio_output_3 = gr.Audio(label="Sample 3", visible=False)
+                status_output = gr.Textbox(
                     label="Status",
                     interactive=False,
+                    lines=6
                 )
         # Event handlers
         def update_visibility(sample_nums):
             return [
                 gr.update(visible=True),  # Sample 1 always visible
+                gr.update(visible=sample_nums >= 2),
+                gr.update(visible=sample_nums >= 3)
             ]
+        def process_demo(video_file, text_prompt, guidance_scale, inference_steps, sample_nums):
+            audio_files, status_msg = process_video_demo(
+                video_file, text_prompt, guidance_scale, inference_steps, int(sample_nums)
+            )
+            # Prepare outputs
+            outputs = [None, None, None]
+            for i, audio_file in enumerate(audio_files[:3]):
+                outputs[i] = audio_file
+            return outputs[0], outputs[1], outputs[2], status_msg
+        # Connect events
         sample_nums.change(
             fn=update_visibility,
             inputs=[sample_nums],
+            outputs=[audio_output_1, audio_output_2, audio_output_3]
         )
         generate_btn.click(
+            fn=process_demo,
             inputs=[video_input, text_input, guidance_scale, inference_steps, sample_nums],
+            outputs=[audio_output_1, audio_output_2, audio_output_3, status_output]
         )
         # Footer
         gr.HTML("""
         <div style="text-align: center; padding: 2rem; color: #666;">
+            <p>🎭 <strong>Demo Version:</strong> Generates synthetic audio for interface demonstration</p>
+            <p>🚀 <strong>Full Version:</strong> <a href="https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley" target="_blank">GitHub Repository</a></p>
         </div>
         """)
     return app
 if __name__ == "__main__":
     # Setup logging
     logger.remove()
     logger.add(lambda msg: print(msg, end=''), level="INFO")
+    logger.info("Starting HunyuanVideo-Foley Working Demo...")
+    # Create and launch app
+    app = create_working_interface()
+    logger.info("Demo app ready - will generate synthetic audio for testing")
     app.launch(
         server_name="0.0.0.0",
+        server_port=7860,
         share=False,
         debug=False,
         show_error=True

app_working.py ADDED Viewed

	@@ -0,0 +1,241 @@

+import os
+import tempfile
+import gradio as gr
+import torch
+import torchaudio
+from loguru import logger
+from typing import Optional, Tuple
+import random
+import numpy as np
+import requests
+import json
+# Simplified working version without loading large models
+def create_demo_audio(video_file, text_prompt: str, duration: float = 5.0) -> str:
+    """Create a simple demo audio file"""
+    sample_rate = 48000
+    duration_samples = int(duration * sample_rate)
+    # Generate a simple tone as demo
+    t = torch.linspace(0, duration, duration_samples)
+    frequency = 440  # A note
+    audio = 0.3 * torch.sin(2 * 3.14159 * frequency * t)
+    # Add some variation based on text prompt length
+    if text_prompt:
+        freq_mod = len(text_prompt) * 10
+        audio += 0.1 * torch.sin(2 * 3.14159 * freq_mod * t)
+    # Save to temporary file
+    temp_dir = tempfile.mkdtemp()
+    audio_path = os.path.join(temp_dir, "demo_audio.wav")
+    torchaudio.save(audio_path, audio.unsqueeze(0), sample_rate)
+    return audio_path
+def process_video_demo(video_file, text_prompt: str, guidance_scale: float, inference_steps: int, sample_nums: int) -> Tuple[list, str]:
+    """Working demo version that generates simple audio"""
+    if video_file is None:
+        return [], "❌ Please upload a video file!"
+    if text_prompt is None:
+        text_prompt = ""
+    try:
+        logger.info(f"Processing video in demo mode: {video_file}")
+        logger.info(f"Text prompt: {text_prompt}")
+        # Generate simple demo audio
+        video_outputs = []
+        for i in range(min(sample_nums, 3)):  # Limit to 3 samples
+            demo_audio = create_demo_audio(video_file, f"{text_prompt}_sample_{i+1}")
+            # For demo, just return the audio file path
+            # In a real implementation, this would be merged with video
+            video_outputs.append(demo_audio)
+        success_msg = f"""✅ Demo Generation Complete!
+📹 **Processed**: {os.path.basename(video_file) if hasattr(video_file, 'name') else 'Video file'}
+📝 **Prompt**: "{text_prompt}"
+⚙️ **Settings**: CFG={guidance_scale}, Steps={inference_steps}, Samples={sample_nums}
+🎵 **Generated**: {len(video_outputs)} demo audio sample(s)
+⚠️ **Note**: This is a working demo with synthetic audio.
+For real AI-generated Foley audio, run locally with the full model:
+https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley"""
+        return video_outputs, success_msg
+    except Exception as e:
+        logger.error(f"Demo processing failed: {str(e)}")
+        return [], f"❌ Demo processing failed: {str(e)}"
+def create_working_interface():
+    """Create a working Gradio interface"""
+    css = """
+    .gradio-container {
+        font-family: 'Inter', sans-serif;
+        background: linear-gradient(135deg, #f5f7fa 0%, #c3cfe2 100%);
+    }
+    .main-header {
+        text-align: center;
+        padding: 2rem;
+        background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+        border-radius: 20px;
+        margin-bottom: 2rem;
+        color: white;
+    }
+    .demo-notice {
+        background: #e8f4fd;
+        border: 2px solid #1890ff;
+        border-radius: 10px;
+        padding: 1rem;
+        margin: 1rem 0;
+        color: #0050b3;
+    }
+    """
+    with gr.Blocks(css=css, title="HunyuanVideo-Foley Demo") as app:
+        # Header
+        with gr.Column(elem_classes=["main-header"]):
+            gr.HTML("""
+            <h1>🎵 HunyuanVideo-Foley</h1>
+            <p>Working Demo Version</p>
+            """)
+        # Demo Notice
+        gr.HTML("""
+        <div class="demo-notice">
+            <strong>🎯 Working Demo:</strong> This version generates synthetic audio to demonstrate the interface.
+            Upload a video and try the controls to see how it works!<br>
+            <strong>For real AI audio:</strong> Visit the <a href="https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley" target="_blank">original repository</a>
+        </div>
+        """)
+        with gr.Row():
+            # Input Section
+            with gr.Column(scale=1):
+                gr.Markdown("### 📹 Video Input")
+                video_input = gr.Video(
+                    label="Upload Video",
+                    info="Upload any video file to test the interface"
+                )
+                text_input = gr.Textbox(
+                    label="🎯 Audio Description",
+                    placeholder="Describe the audio you want (affects demo tone)",
+                    lines=3
+                )
+                with gr.Row():
+                    guidance_scale = gr.Slider(
+                        minimum=1.0,
+                        maximum=10.0,
+                        value=4.0,
+                        step=0.1,
+                        label="🎚️ CFG Scale"
+                    )
+                    inference_steps = gr.Slider(
+                        minimum=10,
+                        maximum=100,
+                        value=50,
+                        step=5,
+                        label="⚡ Steps"
+                    )
+                    sample_nums = gr.Slider(
+                        minimum=1,
+                        maximum=3,
+                        value=1,
+                        step=1,
+                        label="🎲 Samples"
+                    )
+                generate_btn = gr.Button("🎵 Generate Demo Audio", variant="primary")
+            # Output Section
+            with gr.Column(scale=1):
+                gr.Markdown("### 🎵 Generated Audio")
+                audio_output_1 = gr.Audio(label="Sample 1", visible=True)
+                audio_output_2 = gr.Audio(label="Sample 2", visible=False)
+                audio_output_3 = gr.Audio(label="Sample 3", visible=False)
+                status_output = gr.Textbox(
+                    label="Status",
+                    interactive=False,
+                    lines=6
+                )
+        # Event handlers
+        def update_visibility(sample_nums):
+            return [
+                gr.update(visible=True),  # Sample 1 always visible
+                gr.update(visible=sample_nums >= 2),
+                gr.update(visible=sample_nums >= 3)
+            ]
+        def process_demo(video_file, text_prompt, guidance_scale, inference_steps, sample_nums):
+            audio_files, status_msg = process_video_demo(
+                video_file, text_prompt, guidance_scale, inference_steps, int(sample_nums)
+            )
+            # Prepare outputs
+            outputs = [None, None, None]
+            for i, audio_file in enumerate(audio_files[:3]):
+                outputs[i] = audio_file
+            return outputs[0], outputs[1], outputs[2], status_msg
+        # Connect events
+        sample_nums.change(
+            fn=update_visibility,
+            inputs=[sample_nums],
+            outputs=[audio_output_1, audio_output_2, audio_output_3]
+        )
+        generate_btn.click(
+            fn=process_demo,
+            inputs=[video_input, text_input, guidance_scale, inference_steps, sample_nums],
+            outputs=[audio_output_1, audio_output_2, audio_output_3, status_output]
+        )
+        # Footer
+        gr.HTML("""
+        <div style="text-align: center; padding: 2rem; color: #666;">
+            <p>🎭 <strong>Demo Version:</strong> Generates synthetic audio for interface demonstration</p>
+            <p>🚀 <strong>Full Version:</strong> <a href="https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley" target="_blank">GitHub Repository</a></p>
+        </div>
+        """)
+    return app
+if __name__ == "__main__":
+    # Setup logging
+    logger.remove()
+    logger.add(lambda msg: print(msg, end=''), level="INFO")
+    logger.info("Starting HunyuanVideo-Foley Working Demo...")
+    # Create and launch app
+    app = create_working_interface()
+    logger.info("Demo app ready - will generate synthetic audio for testing")
+    app.launch(
+        server_name="0.0.0.0",
+        server_port=7860,
+        share=False,
+        debug=False,
+        show_error=True
+    )

requirements.txt CHANGED Viewed

@@ -1,52 +1,7 @@
-# Core ML dependencies
 torch>=2.0.0
-torchvision>=0.15.0
 torchaudio>=2.0.0
-numpy==1.26.4
-scipy
-# Deep Learning frameworks
-diffusers
-timm
-accelerate
-# Transformers and NLP
-transformers>=4.35.0,<4.50.0
-sentencepiece
-# Audio processing
-git+https://github.com/descriptinc/audiotools
-# Video/Image processing
-pillow
-av
-einops
-# Configuration and utilities
-pyyaml
-omegaconf
-easydict
-loguru
-tqdm
-setuptools
-# Data handling
-pandas
-pyarrow
-# Web interface - update for compatibility
 gradio>=4.0.0
-# Network
-urllib3>=1.26.0
-# Hugging Face integration
-huggingface_hub>=0.16.0
-datasets
-# Additional dependencies for stability
-packaging
-typing-extensions
-# Optional: reduce memory usage
-psutil

+# Minimal requirements for working demo version
 torch>=2.0.0
 torchaudio>=2.0.0
+numpy>=1.21.0
 gradio>=4.0.0
+loguru>=0.6.0
+requests>=2.25.0

requirements_simple_working.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+# Minimal requirements for working demo version
+torch>=2.0.0
+torchaudio>=2.0.0
+numpy>=1.21.0
+gradio>=4.0.0
+loguru>=0.6.0
+requests>=2.25.0