Spaces:

marcosremar2
/

llama-omni

Build error

App Files Files Community

marcosremar2 commited on 25 days ago

Commit

fbce578

1 Parent(s): 538da63

Configurar LLaMA-Omni2 0.5B sem fallback para GPT-2 e preparar para deploy no Hugging Face

Browse files

Files changed (5) hide show

Dockerfile +30 -0
README.md +69 -1
app.py +132 -245
app.yaml +16 -0
requirements.txt +11 -9

Dockerfile ADDED Viewed

	@@ -0,0 +1,30 @@

+FROM pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime
+WORKDIR /app
+# Instalar dependências do sistema
+RUN apt-get update && apt-get install -y \
+    git \
+    wget \
+    ffmpeg \
+    libsndfile1 \
+    && rm -rf /var/lib/apt/lists/*
+# Copiar os arquivos de código
+COPY . .
+# Preparar diretório para modelos
+RUN mkdir -p models
+# Instalar requisitos Python
+RUN pip install --no-cache-dir -r requirements.txt
+# Expor a porta para o Gradio
+EXPOSE 7860
+# Definir variáveis de ambiente
+ENV PYTHONUNBUFFERED=1
+ENV MODELS_DIR=/app/models
+# Comando para iniciar o servidor
+CMD ["python", "app.py"]

README.md CHANGED Viewed

@@ -12,4 +12,72 @@ pinned: false
 # Ex: hardware: nvidia-t4
 ---
-Este é um Hugging Face Space para demonstrar o modelo LLaMA-Omni.

 # Ex: hardware: nvidia-t4
 ---
+# LLaMA-Omni2 + Whisper Demo
+Uma aplicação de demonstração que combina o reconhecimento de fala do Whisper com a geração de texto e fala do LLaMA-Omni2 0.5B.
+## Sobre o Projeto
+Esta aplicação demonstra a capacidade do modelo LLaMA-Omni2 0.5B para processar instruções de fala e gerar respostas tanto em texto quanto em fala, tudo com baixa latência. A arquitetura modular é baseada na pesquisa do Institute of Computing Technology da Chinese Academy of Sciences.
+## Principais Recursos
+- 🎤 **Reconhecimento de Fala**: Usando o OpenAI Whisper-tiny para transcrição de áudio
+- 💬 **Geração de Texto**: Usando o modelo LLaMA-Omni2 para geração de respostas de texto
+- 🔊 **Síntese de Fala**: Geração de fala a partir das respostas de texto (quando disponível)
+- 🔄 **Pipeline Completo**: Fluxo integrado de áudio → texto → resposta → fala
+## Como Usar
+A interface Gradio oferece três modos de interação:
+1. **Pipeline Completo**: Envie um arquivo de áudio, ele será transcrito e usado para gerar uma resposta de texto/fala
+2. **Reconhecimento de Fala**: Teste apenas a capacidade de transcrição do Whisper
+3. **Geração de Texto/Fala**: Forneça seu próprio texto para geração de resposta
+## Arquitetura LLaMA-Omni2
+O LLaMA-Omni2 é um modelo de linguagem e fala que consiste em 4 componentes principais:
+1. **Codificador de Fala**: Baseado no Whisper-large-v3, converte entrada de fala em representações acústicas
+2. **Adaptador de Fala**: Ponte entre os espaços acústico e textual
+3. **Núcleo LLM**: O "motor de raciocínio" baseado em Qwen2.5-Instruct
+4. **Decodificador TTS Streaming**: Converte tokens de texto em fala de forma contínua
+## Configuração para Uso Local
+Se você deseja executar esta aplicação localmente:
+```bash
+# Clone o repositório
+git clone https://github.com/seu-usuario/llama-omni-demo
+cd llama-omni-demo
+# Instale as dependências
+pip install -r requirements.txt
+# Execute a aplicação
+python app.py
+```
+## Requisitos
+- Python 3.10+
+- CUDA compatível (para GPU) ou CPU com pelo menos 8GB de RAM
+- Dependências listadas em requirements.txt
+## Limitações Atuais
+- O LLaMA-Omni2 é um modelo experimental e pode gerar respostas incorretas ou imprecisas
+- A geração de fala pode não estar disponível se o modelo não tiver sido carregado corretamente
+- Requer recursos computacionais significativos para execução ideal
+## Referências
+- [Repositório LLaMA-Omni2](https://github.com/ictnlp/LLaMA-Omni2)
+- [Whisper OpenAI](https://github.com/openai/whisper)
+- [Artigo LLaMA-Omni2](https://arxiv.org/abs/2505.02625)
+## Licença
+Este projeto é licenciado sob a Licença Apache 2.0.

app.py CHANGED Viewed

@@ -10,34 +10,12 @@ import numpy as np
 import tempfile
 import soundfile as sf
-# Check if we can import LLaMA-Omni2's modules
-try_native_modules = True
-native_llama_omni_available = False
-native_modules_error = None
-if try_native_modules:
-    try:
-        # Try importing LLaMA-Omni2 specific modules using subprocess to avoid crashing if imports fail
-        print("Checking for LLaMA-Omni2 native modules...")
-        module_check_result = subprocess.run(
-            [sys.executable, "-c", "import llama_omni2; print('LLaMA-Omni2 modules found!')"],
-            capture_output=True,
-            text=True
-        )
-        if "LLaMA-Omni2 modules found!" in module_check_result.stdout:
-            print("LLaMA-Omni2 native modules are available!")
-            native_llama_omni_available = True
-        else:
-            print(f"LLaMA-Omni2 native modules not found: {module_check_result.stderr}")
-            native_modules_error = module_check_result.stderr
-    except Exception as e:
-        print(f"Error checking for LLaMA-Omni2 native modules: {e}")
-        native_modules_error = str(e)
 # --- Model Configuration ---
 whisper_model_id = "openai/whisper-tiny"
-llama_omni_model_id = "ICTNLP/LLaMA-Omni2-0.5B"  # Primary model we'll try to load
-fallback_model_id = "gpt2"  # Fallback if LLaMA-Omni2 fails to load
 # --- Device Configuration ---
 if torch.cuda.is_available():
@@ -70,72 +48,52 @@ except Exception as e:
 # --- Load Text Generation Model ---
 text_gen_pipeline_instance = None
 text_generation_model_id = None  # Will be set to the model that successfully loads
-llama_omni_native_module = None  # Will hold the native LLaMA-Omni2 module if loaded
-# Try native LLaMA-Omni2 module first if available
-if native_llama_omni_available:
-    try:
-        print("Attempting to load LLaMA-Omni2 using native modules...")
-        # Import the required modules
-        import llama_omni2
-        from llama_omni2.model import Model as LLamaOmniModel
-        # Load the model
-        llama_omni_native_module = LLamaOmniModel.from_pretrained(llama_omni_model_id)
-        text_generation_model_id = llama_omni_model_id
-        print(f"LLaMA-Omni2 native module loaded successfully: {type(llama_omni_native_module)}")
-    except Exception as e:
-        print(f"Error loading native LLaMA-Omni2 module: {e}")
-        llama_omni_native_module = None
-# If native module failed, try loading using transformers with special handling for Omni2
-if llama_omni_native_module is None and text_generation_model_id is None:
-    try:
-        print(f"Attempting to load LLaMA-Omni2 using transformers: {llama_omni_model_id}...")
-        # LLaMA models often require specific loading configurations
-        tokenizer = AutoTokenizer.from_pretrained(llama_omni_model_id, trust_remote_code=True)
-        model = AutoModelForCausalLM.from_pretrained(
-            llama_omni_model_id,
-            torch_dtype=dtype_for_pipelines,
-            trust_remote_code=True,
-            device_map="auto" if torch.cuda.is_available() else None,
-            low_cpu_mem_usage=True
-        )
-        # Check if this is a specialized Omni2 model with audio capabilities
-        is_omni2_speech_model = hasattr(model, "generate_with_speech") or hasattr(model, "generate_speech")
-        text_gen_pipeline_instance = pipeline(
-            "text-generation",
-            model=model,
-            tokenizer=tokenizer,
-            torch_dtype=dtype_for_pipelines,
-            device=device_for_pipelines if not torch.cuda.is_available() else None
-        )
-        text_generation_model_id = llama_omni_model_id
-        print(f"LLaMA-Omni2 model ({llama_omni_model_id}) loaded successfully via transformers.")
-        print(f"Model has speech generation capabilities: {is_omni2_speech_model}")
-    except Exception as e:
-        warnings.warn(f"Error loading LLaMA-Omni2 model: {e}\nFalling back to {fallback_model_id}")
-        print(f"Error loading LLaMA-Omni2 model via transformers: {e}")
-        print(f"Falling back to {fallback_model_id}")
-# Fall back to GPT-2 if LLaMA-Omni2 fails to load both ways
-if text_generation_model_id is None:
-    try:
-        print(f"Loading fallback text generation model: {fallback_model_id}...")
-        text_gen_pipeline_instance = pipeline(
-            "text-generation",
-            model=fallback_model_id,
-            torch_dtype=dtype_for_pipelines,
-            device=device_for_pipelines
-        )
-        text_generation_model_id = fallback_model_id
-        print(f"Fallback model ({fallback_model_id}) loaded successfully.")
-    except Exception as e:
-        print(f"Error loading fallback model ({fallback_model_id}): {e}")
-        text_gen_pipeline_instance = None
 # --- Core Functions ---
 def transcribe_audio_input(audio_filepath):
@@ -155,40 +113,6 @@ def transcribe_audio_input(audio_filepath):
 def generate_text_response(prompt_text):
     """Generate both text and speech response if possible"""
-    # If we have a native LLaMA-Omni2 module, use it for text and speech
-    if llama_omni_native_module is not None:
-        if not prompt_text or not prompt_text.strip():
-            return "Prompt is empty. Please provide text for generation.", None
-        try:
-            print(f"Generating response with native LLaMA-Omni2 for prompt: '{prompt_text[:100]}...'")
-            # Using the native module's interface for text and speech generation
-            if hasattr(llama_omni_native_module, "generate_with_speech"):
-                # This method should return both text and audio
-                text_response, audio_data = llama_omni_native_module.generate_with_speech(
-                    prompt_text,
-                    max_length=150
-                )
-                # Save audio to a temporary file
-                if audio_data is not None:
-                    audio_path = save_audio_to_temp_file(audio_data)
-                    print(f"Generated response with audio: '{text_response}'")
-                    return text_response, audio_path
-                else:
-                    print(f"Generated text response (no audio): '{text_response}'")
-                    return text_response, None
-            else:
-                # Fallback to text-only generation
-                response = llama_omni_native_module.generate(prompt_text, max_length=150)
-                print(f"Generated text-only response: '{response}'")
-                return response, None
-        except Exception as e:
-            print(f"Error using native LLaMA-Omni2 generation: {e}")
-            return f"Error during native LLaMA-Omni2 text generation: {str(e)}", None
-    # Try transformers model with possible speech capabilities
     if not text_gen_pipeline_instance:
         return f"Text generation model not available. Check logs.", None
     if not prompt_text or not prompt_text.strip():
@@ -198,67 +122,59 @@ def generate_text_response(prompt_text):
         print(f"Generating response for prompt (first 100 chars): '{prompt_text[:100]}...'")
         # Try to use special speech generation if available
-        if text_generation_model_id == llama_omni_model_id:
-            model = text_gen_pipeline_instance.model
-            # Check if model has speech generation capability
-            if hasattr(model, "generate_with_speech") or hasattr(model, "generate_speech"):
-                try:
-                    # Prepare inputs
-                    inputs = text_gen_pipeline_instance.tokenizer(prompt_text, return_tensors="pt").to(model.device)
-                    # Generate with speech
-                    if hasattr(model, "generate_with_speech"):
-                        outputs = model.generate_with_speech(
-                            **inputs,
-                            max_new_tokens=150,
-                            do_sample=True,
-                            temperature=0.7,
-                            top_p=0.9
-                        )
-                        text_response = text_gen_pipeline_instance.tokenizer.decode(outputs["sequences"][0], skip_special_tokens=True)
-                        audio_data = outputs.get("speech_output", None)
-                    elif hasattr(model, "generate_speech"):
-                        # Text generation first
-                        output_ids = model.generate(
-                            **inputs,
-                            max_new_tokens=150,
-                            do_sample=True,
-                            temperature=0.7,
-                            top_p=0.9
-                        )
-                        text_response = text_gen_pipeline_instance.tokenizer.decode(output_ids[0], skip_special_tokens=True)
-                        # Then speech generation
-                        audio_data = model.generate_speech(output_ids)
-                    # Save audio if we got it
-                    if audio_data is not None:
-                        audio_path = save_audio_to_temp_file(audio_data)
-                        return text_response, audio_path
-                    else:
-                        return text_response, None
-                except Exception as speech_error:
-                    print(f"Error generating speech with LLaMA-Omni2: {speech_error}")
-                    print("Falling back to text-only generation")
-            # Parameters optimized for LLaMA-Omni2 text-only generation
-            generated_outputs = text_gen_pipeline_instance(
-                prompt_text,
-                max_new_tokens=150,
-                do_sample=True,
-                temperature=0.7,
-                top_p=0.9,
-                num_return_sequences=1
-            )
-        else:
-            # Parameters for fallback model
-            generated_outputs = text_gen_pipeline_instance(
-                prompt_text,
-                max_new_tokens=100,
-                num_return_sequences=1
-            )
         response_text = generated_outputs[0]["generated_text"]
         print(f"Generated text-only response: '{response_text}'")
@@ -304,24 +220,18 @@ def combined_pipeline_process(audio_filepath):
             error_msg_for_generation = "Cannot generate response: ASR model not loaded."
         return transcribed_text, error_msg_for_generation, None
-    if not text_gen_pipeline_instance and llama_omni_native_module is None:
         return transcribed_text, f"Cannot generate response: No text generation model available.", None
     final_response, audio_path = generate_text_response(transcribed_text)
     return transcribed_text, final_response, audio_path
 # Determine model status for UI
-if llama_omni_native_module is not None:
-    llama_model_status = "Native LLaMA-Omni2 module loaded successfully"
-    using_model = "LLaMA-Omni2-0.5B (native modules)"
-elif text_generation_model_id == llama_omni_model_id:
-    llama_model_status = "LLaMA-Omni2 loaded via transformers"
-    using_model = "LLaMA-Omni2-0.5B (via transformers)"
-elif text_generation_model_id == fallback_model_id:
-    llama_model_status = "Failed to load - Using GPT-2 as fallback"
-    using_model = "GPT-2 (fallback model)"
 else:
-    llama_model_status = "Failed to load any text generation model"
     using_model = "No model available"
 # --- Gradio Interface Definition ---
@@ -330,23 +240,22 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Whisper + LLaMA-Omni2 Demo") as ap
         f"""
         # Speech-to-Text and Text/Speech Generation Demo
-        This application uses **OpenAI Whisper Tiny** for speech recognition and attempts to use **LLaMA-Omni2-0.5B** for text and speech generation.
-        If LLaMA-Omni2 cannot be loaded, it falls back to GPT-2 (text only).
-        **Currently using:** {using_model}
-        Upload an audio file to transcribe it. The transcribed text will then be used as a prompt for the text/speech generation model.
         """
     )
-    with gr.Tab("Full Pipeline: Audio -> Transcription -> Generation"):
-        gr.Markdown("### Step 1: Upload Audio -> Step 2: Transcribe -> Step 3: Generate Text/Speech")
-        input_audio_pipeline = gr.Audio(type="filepath", label="Upload Your Audio File (.wav, .mp3)")
-        submit_button_full = gr.Button("Run Full Process", variant="primary")
-        output_transcription_pipeline = gr.Textbox(label="Transcribed Text (from Whisper)", lines=5)
-        model_label = f"Generated Text (from {using_model})"
         output_generation_pipeline = gr.Textbox(label=model_label, lines=7)
-        output_audio_pipeline = gr.Audio(label="Generated Speech (if available)", visible=True)
         submit_button_full.click(
             fn=combined_pipeline_process,
@@ -354,14 +263,14 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Whisper + LLaMA-Omni2 Demo") as ap
             outputs=[output_transcription_pipeline, output_generation_pipeline, output_audio_pipeline]
         )
-    with gr.Tab("Test Speech-to-Text (Whisper Tiny)"):
-        gr.Markdown("### Transcribe audio to text using Whisper Tiny.")
-        input_audio_asr = gr.Audio(type="filepath", label="Upload Audio for ASR")
-        submit_button_asr = gr.Button("Transcribe Audio", variant="secondary")
-        output_transcription_asr = gr.Textbox(label="Transcription Result", lines=10)
         def asr_only_ui(audio_file):
-            if audio_file is None: return "Please upload an audio file."
             transcription, _ = transcribe_audio_input(audio_file)
             return transcription
@@ -371,17 +280,17 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Whisper + LLaMA-Omni2 Demo") as ap
             outputs=[output_transcription_asr]
         )
-    with gr.Tab(f"Test Text/Speech Generation"):
         model_name_gen = using_model
-        gr.Markdown(f"### Generate text and speech from a prompt using {model_name_gen}.")
-        input_text_prompt_gen = gr.Textbox(label="Your Text Prompt", placeholder="Enter text here...", lines=5)
-        submit_button_gen = gr.Button("Generate Text & Speech", variant="secondary")
-        output_generation_gen = gr.Textbox(label="Generated Text Result", lines=10)
-        output_audio_gen = gr.Audio(label="Generated Speech (if available)")
         def text_generation_ui(prompt):
             if not prompt or not prompt.strip():
-                return "Please provide a prompt first.", None
             response_text, audio_path = generate_text_response(prompt)
             return response_text, audio_path
@@ -392,40 +301,18 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Whisper + LLaMA-Omni2 Demo") as ap
         )
     gr.Markdown("--- ")
-    gr.Markdown("### Model Loading Status (at application start):")
-    asr_load_status = "Successfully Loaded" if asr_pipeline_instance else "Failed to Load (check console logs)"
-    gr.Markdown(f"*   **Whisper Model ({whisper_model_id}):** `{asr_load_status}`")
-    gr.Markdown(f"*   **LLaMA-Omni2 Model ({llama_omni_model_id}):** `{llama_model_status}`")
-    if native_llama_omni_available:
-        gr.Markdown("*   **LLaMA-Omni2 Native Modules:** `Available`")
-    else:
-        native_error = f": {native_modules_error}" if native_modules_error else ""
-        gr.Markdown(f"*   **LLaMA-Omni2 Native Modules:** `Not Available{native_error}`")
-    if using_model.startswith("GPT-2"):
-        gr.Markdown(
-            """
-            **Note about LLaMA-Omni2-0.5B:** This model has complex dependencies and requires a specific setup environment.
-            The system attempted to load it but fell back to GPT-2. For full functionality with LLaMA-Omni2, you should:
-            1. Clone the [LLaMA-Omni2 repository](https://github.com/ictnlp/LLaMA-Omni2)
-            2. Install the required dependencies including CosyVoice 2
-            3. Download the Whisper-large-v3 model and flow-matching model and vocoder of CosyVoice 2
-            4. Set up the controller, model worker, and web server as described in the repository
-            Note that LLaMA-Omni2 is designed for generating both text and speech responses simultaneously.
-            For the full experience with speech synthesis, you need the complete setup.
-            """
-        )
 # --- Launch the Gradio App ---
 if __name__ == "__main__":
     print("Launching Gradio demo...")
     try:
-        app_interface.launch(share=True)
     except Exception as e:
         print(f"Error launching with share=True: {e}")
         print("Trying to launch without sharing...")
-        app_interface.launch()

 import tempfile
 import soundfile as sf
+# Configuração do caminho para os modelos
+MODELS_DIR = os.environ.get("MODELS_DIR", "models")
 # --- Model Configuration ---
 whisper_model_id = "openai/whisper-tiny"
+llama_omni_model_id = "ICTNLP/LLaMA-Omni2-0.5B"  # Modelo específico que queremos usar
 # --- Device Configuration ---
 if torch.cuda.is_available():
 # --- Load Text Generation Model ---
 text_gen_pipeline_instance = None
 text_generation_model_id = None  # Will be set to the model that successfully loads
+# Verificar se o modelo já está baixado
+local_model_path = os.path.join(MODELS_DIR, os.path.basename(llama_omni_model_id))
+if os.path.exists(local_model_path):
+    print(f"Found local model at {local_model_path}")
+    model_path_to_use = local_model_path
+else:
+    print(f"Using model from Hugging Face Hub: {llama_omni_model_id}")
+    model_path_to_use = llama_omni_model_id
+try:
+    print(f"Attempting to load LLaMA-Omni2 model: {model_path_to_use}...")
+    # LLaMA models often require specific loading configurations
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_path_to_use,
+        trust_remote_code=True,
+        use_fast=False
+    )
+    model = AutoModelForCausalLM.from_pretrained(
+        model_path_to_use,
+        torch_dtype=dtype_for_pipelines,
+        trust_remote_code=True,
+        device_map="auto" if torch.cuda.is_available() else None,
+        low_cpu_mem_usage=True
+    )
+    # Check if this is a specialized Omni2 model with audio capabilities
+    is_omni2_speech_model = hasattr(model, "generate_with_speech") or hasattr(model, "generate_speech")
+    text_gen_pipeline_instance = pipeline(
+        "text-generation",
+        model=model,
+        tokenizer=tokenizer,
+        torch_dtype=dtype_for_pipelines,
+        device=device_for_pipelines if not torch.cuda.is_available() else None
+    )
+    text_generation_model_id = llama_omni_model_id
+    print(f"LLaMA-Omni2 model ({llama_omni_model_id}) loaded successfully.")
+    print(f"Model has speech generation capabilities: {is_omni2_speech_model}")
+except Exception as e:
+    print(f"Error loading LLaMA-Omni2 model: {e}")
+    print("Não foi possível carregar o modelo LLaMA-Omni2. Verifique se o modelo está disponível ou se há erro nas configurações.")
+    text_gen_pipeline_instance = None
+    # Não há fallback para GPT-2 agora
 # --- Core Functions ---
 def transcribe_audio_input(audio_filepath):
 def generate_text_response(prompt_text):
     """Generate both text and speech response if possible"""
     if not text_gen_pipeline_instance:
         return f"Text generation model not available. Check logs.", None
     if not prompt_text or not prompt_text.strip():
         print(f"Generating response for prompt (first 100 chars): '{prompt_text[:100]}...'")
         # Try to use special speech generation if available
+        model = text_gen_pipeline_instance.model
+        # Check if model has speech generation capability
+        if hasattr(model, "generate_with_speech") or hasattr(model, "generate_speech"):
+            try:
+                # Prepare inputs
+                inputs = text_gen_pipeline_instance.tokenizer(prompt_text, return_tensors="pt").to(model.device)
+                # Generate with speech
+                if hasattr(model, "generate_with_speech"):
+                    outputs = model.generate_with_speech(
+                        **inputs,
+                        max_new_tokens=150,
+                        do_sample=True,
+                        temperature=0.7,
+                        top_p=0.9
+                    )
+                    text_response = text_gen_pipeline_instance.tokenizer.decode(outputs["sequences"][0], skip_special_tokens=True)
+                    audio_data = outputs.get("speech_output", None)
+                elif hasattr(model, "generate_speech"):
+                    # Text generation first
+                    output_ids = model.generate(
+                        **inputs,
+                        max_new_tokens=150,
+                        do_sample=True,
+                        temperature=0.7,
+                        top_p=0.9
+                    )
+                    text_response = text_gen_pipeline_instance.tokenizer.decode(output_ids[0], skip_special_tokens=True)
+                    # Then speech generation
+                    audio_data = model.generate_speech(output_ids)
+                # Save audio if we got it
+                if audio_data is not None:
+                    audio_path = save_audio_to_temp_file(audio_data)
+                    return text_response, audio_path
+                else:
+                    return text_response, None
+            except Exception as speech_error:
+                print(f"Error generating speech with LLaMA-Omni2: {speech_error}")
+                print("Falling back to text-only generation")
+        # Parameters optimized for LLaMA-Omni2 text-only generation
+        generated_outputs = text_gen_pipeline_instance(
+            prompt_text,
+            max_new_tokens=150,
+            do_sample=True,
+            temperature=0.7,
+            top_p=0.9,
+            num_return_sequences=1
+        )
         response_text = generated_outputs[0]["generated_text"]
         print(f"Generated text-only response: '{response_text}'")
             error_msg_for_generation = "Cannot generate response: ASR model not loaded."
         return transcribed_text, error_msg_for_generation, None
+    if not text_gen_pipeline_instance:
         return transcribed_text, f"Cannot generate response: No text generation model available.", None
     final_response, audio_path = generate_text_response(transcribed_text)
     return transcribed_text, final_response, audio_path
 # Determine model status for UI
+if text_generation_model_id == llama_omni_model_id:
+    llama_model_status = "LLaMA-Omni2-0.5B loaded successfully"
+    using_model = "LLaMA-Omni2-0.5B"
 else:
+    llama_model_status = "Failed to load LLaMA-Omni2 model"
     using_model = "No model available"
 # --- Gradio Interface Definition ---
         f"""
         # Speech-to-Text and Text/Speech Generation Demo
+        Esta aplicação usa **OpenAI Whisper Tiny** para reconhecimento de fala e **LLaMA-Omni2-0.5B** para geração de texto e fala.
+        **Modelo em uso:** {using_model}
+        Envie um arquivo de áudio para transcrevê-lo. O texto transcrito será então usado como prompt para o modelo de geração de texto/fala.
         """
     )
+    with gr.Tab("Pipeline Completo: Áudio -> Transcrição -> Geração"):
+        gr.Markdown("### Etapa 1: Envie Áudio -> Etapa 2: Transcrição -> Etapa 3: Geração de Texto/Fala")
+        input_audio_pipeline = gr.Audio(type="filepath", label="Envie seu arquivo de áudio (.wav, .mp3)")
+        submit_button_full = gr.Button("Executar Processo Completo", variant="primary")
+        output_transcription_pipeline = gr.Textbox(label="Texto Transcrito (do Whisper)", lines=5)
+        model_label = f"Texto Gerado (do {using_model})"
         output_generation_pipeline = gr.Textbox(label=model_label, lines=7)
+        output_audio_pipeline = gr.Audio(label="Fala Gerada (se disponível)", visible=True)
         submit_button_full.click(
             fn=combined_pipeline_process,
             outputs=[output_transcription_pipeline, output_generation_pipeline, output_audio_pipeline]
         )
+    with gr.Tab("Testar Reconhecimento de Fala (Whisper Tiny)"):
+        gr.Markdown("### Transcreva áudio para texto usando Whisper Tiny.")
+        input_audio_asr = gr.Audio(type="filepath", label="Envie Áudio para Reconhecimento")
+        submit_button_asr = gr.Button("Transcrever Áudio", variant="secondary")
+        output_transcription_asr = gr.Textbox(label="Resultado da Transcrição", lines=10)
         def asr_only_ui(audio_file):
+            if audio_file is None: return "Por favor, envie um arquivo de áudio."
             transcription, _ = transcribe_audio_input(audio_file)
             return transcription
             outputs=[output_transcription_asr]
         )
+    with gr.Tab(f"Testar Geração de Texto/Fala"):
         model_name_gen = using_model
+        gr.Markdown(f"### Gere texto e fala a partir de um prompt usando {model_name_gen}.")
+        input_text_prompt_gen = gr.Textbox(label="Seu Prompt de Texto", placeholder="Digite seu texto aqui...", lines=5)
+        submit_button_gen = gr.Button("Gerar Texto e Fala", variant="secondary")
+        output_generation_gen = gr.Textbox(label="Resultado do Texto Gerado", lines=10)
+        output_audio_gen = gr.Audio(label="Fala Gerada (se disponível)")
         def text_generation_ui(prompt):
             if not prompt or not prompt.strip():
+                return "Por favor, forneça um prompt primeiro.", None
             response_text, audio_path = generate_text_response(prompt)
             return response_text, audio_path
         )
     gr.Markdown("--- ")
+    gr.Markdown("### Status do Carregamento do Modelo (na inicialização do aplicativo):")
+    asr_load_status = "Carregado com sucesso" if asr_pipeline_instance else "Falha ao carregar (verifique os logs)"
+    gr.Markdown(f"*   **Modelo Whisper ({whisper_model_id}):** `{asr_load_status}`")
+    gr.Markdown(f"*   **Modelo LLaMA-Omni2 ({llama_omni_model_id}):** `{llama_model_status}`")
 # --- Launch the Gradio App ---
 if __name__ == "__main__":
     print("Launching Gradio demo...")
     try:
+        app_interface.launch(share=True, server_name="0.0.0.0")
     except Exception as e:
         print(f"Error launching with share=True: {e}")
         print("Trying to launch without sharing...")
+        app_interface.launch(server_name="0.0.0.0")

app.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+sdk: docker
+build_config:
+  gpu: true
+  cuda: "11.8"
+  python_version: "3.10"
+  system_packages:
+    - "ffmpeg"
+    - "libsndfile1"
+resources:
+  gpu: 1
+  cpu: 2
+  memory: "16G"
+  disk: "10G"
+models:
+  - "openai/whisper-tiny"
+  - "ICTNLP/LLaMA-Omni2-0.5B"

requirements.txt CHANGED Viewed

@@ -1,23 +1,25 @@
-torch>=2.1.0
 torchaudio>=2.1.0
 torchvision>=0.16.0
 # packaging # often a dep of others
 # ninja # often a dep of others
 uvicorn
-gradio>=3.50.2 # Keep Gradio, LLaMA-Omni2 uses it. Update if a newer version is needed.
 einops
-transformers>=4.36.2 # Or a version compatible with LLaMA-Omni2 and Whisper
-accelerate
 bitsandbytes # If LLaMA-Omni2 makes use of it for 4/8bit loading
-sentencepiece
 protobuf
-openai-whisper
 shortuuid
 pydub
 ffmpeg-python
 huggingface_hub # For downloading models from HF Hub
-soundfile # To handle audio files if not using gr.Audio input directly for some reason
-safetensors
 ai2-olmo # In case LLaMA-Omni2 uses olmo under the hood for the LLM part
-# fairseq and flash-attn are removed, expected to be handled by LLaMA-Omni2's setup via `pip install -e .` in Dockerfile

+torch>=2.0.0
 torchaudio>=2.1.0
 torchvision>=0.16.0
 # packaging # often a dep of others
 # ninja # often a dep of others
 uvicorn
+gradio>=5.29.0
 einops
+transformers>=4.28.1,<5.0.0
+accelerate>=0.33.0
 bitsandbytes # If LLaMA-Omni2 makes use of it for 4/8bit loading
+sentencepiece>=0.1.99
 protobuf
+openai-whisper>=20230918
 shortuuid
 pydub
 ffmpeg-python
 huggingface_hub # For downloading models from HF Hub
+soundfile>=0.13.0
+safetensors>=0.3.1
 ai2-olmo # In case LLaMA-Omni2 uses olmo under the hood for the LLM part
+# fairseq and flash-attn are removed, expected to be handled by LLaMA-Omni2's setup via `pip install -e .` in Dockerfile
+numpy>=1.21.0