marcosremar2 commited on
Commit
fbce578
·
1 Parent(s): 538da63

Configurar LLaMA-Omni2 0.5B sem fallback para GPT-2 e preparar para deploy no Hugging Face

Browse files
Files changed (5) hide show
  1. Dockerfile +30 -0
  2. README.md +69 -1
  3. app.py +132 -245
  4. app.yaml +16 -0
  5. requirements.txt +11 -9
Dockerfile ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime
2
+
3
+ WORKDIR /app
4
+
5
+ # Instalar dependências do sistema
6
+ RUN apt-get update && apt-get install -y \
7
+ git \
8
+ wget \
9
+ ffmpeg \
10
+ libsndfile1 \
11
+ && rm -rf /var/lib/apt/lists/*
12
+
13
+ # Copiar os arquivos de código
14
+ COPY . .
15
+
16
+ # Preparar diretório para modelos
17
+ RUN mkdir -p models
18
+
19
+ # Instalar requisitos Python
20
+ RUN pip install --no-cache-dir -r requirements.txt
21
+
22
+ # Expor a porta para o Gradio
23
+ EXPOSE 7860
24
+
25
+ # Definir variáveis de ambiente
26
+ ENV PYTHONUNBUFFERED=1
27
+ ENV MODELS_DIR=/app/models
28
+
29
+ # Comando para iniciar o servidor
30
+ CMD ["python", "app.py"]
README.md CHANGED
@@ -12,4 +12,72 @@ pinned: false
12
  # Ex: hardware: nvidia-t4
13
  ---
14
 
15
- Este é um Hugging Face Space para demonstrar o modelo LLaMA-Omni.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  # Ex: hardware: nvidia-t4
13
  ---
14
 
15
+ # LLaMA-Omni2 + Whisper Demo
16
+
17
+ Uma aplicação de demonstração que combina o reconhecimento de fala do Whisper com a geração de texto e fala do LLaMA-Omni2 0.5B.
18
+
19
+ ## Sobre o Projeto
20
+
21
+ Esta aplicação demonstra a capacidade do modelo LLaMA-Omni2 0.5B para processar instruções de fala e gerar respostas tanto em texto quanto em fala, tudo com baixa latência. A arquitetura modular é baseada na pesquisa do Institute of Computing Technology da Chinese Academy of Sciences.
22
+
23
+ ## Principais Recursos
24
+
25
+ - 🎤 **Reconhecimento de Fala**: Usando o OpenAI Whisper-tiny para transcrição de áudio
26
+ - 💬 **Geração de Texto**: Usando o modelo LLaMA-Omni2 para geração de respostas de texto
27
+ - 🔊 **Síntese de Fala**: Geração de fala a partir das respostas de texto (quando disponível)
28
+ - 🔄 **Pipeline Completo**: Fluxo integrado de áudio → texto → resposta → fala
29
+
30
+ ## Como Usar
31
+
32
+ A interface Gradio oferece três modos de interação:
33
+
34
+ 1. **Pipeline Completo**: Envie um arquivo de áudio, ele será transcrito e usado para gerar uma resposta de texto/fala
35
+ 2. **Reconhecimento de Fala**: Teste apenas a capacidade de transcrição do Whisper
36
+ 3. **Geração de Texto/Fala**: Forneça seu próprio texto para geração de resposta
37
+
38
+ ## Arquitetura LLaMA-Omni2
39
+
40
+ O LLaMA-Omni2 é um modelo de linguagem e fala que consiste em 4 componentes principais:
41
+
42
+ 1. **Codificador de Fala**: Baseado no Whisper-large-v3, converte entrada de fala em representações acústicas
43
+ 2. **Adaptador de Fala**: Ponte entre os espaços acústico e textual
44
+ 3. **Núcleo LLM**: O "motor de raciocínio" baseado em Qwen2.5-Instruct
45
+ 4. **Decodificador TTS Streaming**: Converte tokens de texto em fala de forma contínua
46
+
47
+ ## Configuração para Uso Local
48
+
49
+ Se você deseja executar esta aplicação localmente:
50
+
51
+ ```bash
52
+ # Clone o repositório
53
+ git clone https://github.com/seu-usuario/llama-omni-demo
54
+ cd llama-omni-demo
55
+
56
+ # Instale as dependências
57
+ pip install -r requirements.txt
58
+
59
+ # Execute a aplicação
60
+ python app.py
61
+ ```
62
+
63
+ ## Requisitos
64
+
65
+ - Python 3.10+
66
+ - CUDA compatível (para GPU) ou CPU com pelo menos 8GB de RAM
67
+ - Dependências listadas em requirements.txt
68
+
69
+ ## Limitações Atuais
70
+
71
+ - O LLaMA-Omni2 é um modelo experimental e pode gerar respostas incorretas ou imprecisas
72
+ - A geração de fala pode não estar disponível se o modelo não tiver sido carregado corretamente
73
+ - Requer recursos computacionais significativos para execução ideal
74
+
75
+ ## Referências
76
+
77
+ - [Repositório LLaMA-Omni2](https://github.com/ictnlp/LLaMA-Omni2)
78
+ - [Whisper OpenAI](https://github.com/openai/whisper)
79
+ - [Artigo LLaMA-Omni2](https://arxiv.org/abs/2505.02625)
80
+
81
+ ## Licença
82
+
83
+ Este projeto é licenciado sob a Licença Apache 2.0.
app.py CHANGED
@@ -10,34 +10,12 @@ import numpy as np
10
  import tempfile
11
  import soundfile as sf
12
 
13
- # Check if we can import LLaMA-Omni2's modules
14
- try_native_modules = True
15
- native_llama_omni_available = False
16
- native_modules_error = None
17
-
18
- if try_native_modules:
19
- try:
20
- # Try importing LLaMA-Omni2 specific modules using subprocess to avoid crashing if imports fail
21
- print("Checking for LLaMA-Omni2 native modules...")
22
- module_check_result = subprocess.run(
23
- [sys.executable, "-c", "import llama_omni2; print('LLaMA-Omni2 modules found!')"],
24
- capture_output=True,
25
- text=True
26
- )
27
- if "LLaMA-Omni2 modules found!" in module_check_result.stdout:
28
- print("LLaMA-Omni2 native modules are available!")
29
- native_llama_omni_available = True
30
- else:
31
- print(f"LLaMA-Omni2 native modules not found: {module_check_result.stderr}")
32
- native_modules_error = module_check_result.stderr
33
- except Exception as e:
34
- print(f"Error checking for LLaMA-Omni2 native modules: {e}")
35
- native_modules_error = str(e)
36
 
37
  # --- Model Configuration ---
38
  whisper_model_id = "openai/whisper-tiny"
39
- llama_omni_model_id = "ICTNLP/LLaMA-Omni2-0.5B" # Primary model we'll try to load
40
- fallback_model_id = "gpt2" # Fallback if LLaMA-Omni2 fails to load
41
 
42
  # --- Device Configuration ---
43
  if torch.cuda.is_available():
@@ -70,72 +48,52 @@ except Exception as e:
70
  # --- Load Text Generation Model ---
71
  text_gen_pipeline_instance = None
72
  text_generation_model_id = None # Will be set to the model that successfully loads
73
- llama_omni_native_module = None # Will hold the native LLaMA-Omni2 module if loaded
74
 
75
- # Try native LLaMA-Omni2 module first if available
76
- if native_llama_omni_available:
77
- try:
78
- print("Attempting to load LLaMA-Omni2 using native modules...")
79
- # Import the required modules
80
- import llama_omni2
81
- from llama_omni2.model import Model as LLamaOmniModel
82
-
83
- # Load the model
84
- llama_omni_native_module = LLamaOmniModel.from_pretrained(llama_omni_model_id)
85
- text_generation_model_id = llama_omni_model_id
86
- print(f"LLaMA-Omni2 native module loaded successfully: {type(llama_omni_native_module)}")
87
- except Exception as e:
88
- print(f"Error loading native LLaMA-Omni2 module: {e}")
89
- llama_omni_native_module = None
90
 
91
- # If native module failed, try loading using transformers with special handling for Omni2
92
- if llama_omni_native_module is None and text_generation_model_id is None:
93
- try:
94
- print(f"Attempting to load LLaMA-Omni2 using transformers: {llama_omni_model_id}...")
95
- # LLaMA models often require specific loading configurations
96
- tokenizer = AutoTokenizer.from_pretrained(llama_omni_model_id, trust_remote_code=True)
97
- model = AutoModelForCausalLM.from_pretrained(
98
- llama_omni_model_id,
99
- torch_dtype=dtype_for_pipelines,
100
- trust_remote_code=True,
101
- device_map="auto" if torch.cuda.is_available() else None,
102
- low_cpu_mem_usage=True
103
- )
104
-
105
- # Check if this is a specialized Omni2 model with audio capabilities
106
- is_omni2_speech_model = hasattr(model, "generate_with_speech") or hasattr(model, "generate_speech")
107
-
108
- text_gen_pipeline_instance = pipeline(
109
- "text-generation",
110
- model=model,
111
- tokenizer=tokenizer,
112
- torch_dtype=dtype_for_pipelines,
113
- device=device_for_pipelines if not torch.cuda.is_available() else None
114
- )
115
- text_generation_model_id = llama_omni_model_id
116
- print(f"LLaMA-Omni2 model ({llama_omni_model_id}) loaded successfully via transformers.")
117
- print(f"Model has speech generation capabilities: {is_omni2_speech_model}")
118
-
119
- except Exception as e:
120
- warnings.warn(f"Error loading LLaMA-Omni2 model: {e}\nFalling back to {fallback_model_id}")
121
- print(f"Error loading LLaMA-Omni2 model via transformers: {e}")
122
- print(f"Falling back to {fallback_model_id}")
123
-
124
- # Fall back to GPT-2 if LLaMA-Omni2 fails to load both ways
125
- if text_generation_model_id is None:
126
- try:
127
- print(f"Loading fallback text generation model: {fallback_model_id}...")
128
- text_gen_pipeline_instance = pipeline(
129
- "text-generation",
130
- model=fallback_model_id,
131
- torch_dtype=dtype_for_pipelines,
132
- device=device_for_pipelines
133
- )
134
- text_generation_model_id = fallback_model_id
135
- print(f"Fallback model ({fallback_model_id}) loaded successfully.")
136
- except Exception as e:
137
- print(f"Error loading fallback model ({fallback_model_id}): {e}")
138
- text_gen_pipeline_instance = None
139
 
140
  # --- Core Functions ---
141
  def transcribe_audio_input(audio_filepath):
@@ -155,40 +113,6 @@ def transcribe_audio_input(audio_filepath):
155
 
156
  def generate_text_response(prompt_text):
157
  """Generate both text and speech response if possible"""
158
- # If we have a native LLaMA-Omni2 module, use it for text and speech
159
- if llama_omni_native_module is not None:
160
- if not prompt_text or not prompt_text.strip():
161
- return "Prompt is empty. Please provide text for generation.", None
162
- try:
163
- print(f"Generating response with native LLaMA-Omni2 for prompt: '{prompt_text[:100]}...'")
164
-
165
- # Using the native module's interface for text and speech generation
166
- if hasattr(llama_omni_native_module, "generate_with_speech"):
167
- # This method should return both text and audio
168
- text_response, audio_data = llama_omni_native_module.generate_with_speech(
169
- prompt_text,
170
- max_length=150
171
- )
172
-
173
- # Save audio to a temporary file
174
- if audio_data is not None:
175
- audio_path = save_audio_to_temp_file(audio_data)
176
- print(f"Generated response with audio: '{text_response}'")
177
- return text_response, audio_path
178
- else:
179
- print(f"Generated text response (no audio): '{text_response}'")
180
- return text_response, None
181
- else:
182
- # Fallback to text-only generation
183
- response = llama_omni_native_module.generate(prompt_text, max_length=150)
184
- print(f"Generated text-only response: '{response}'")
185
- return response, None
186
-
187
- except Exception as e:
188
- print(f"Error using native LLaMA-Omni2 generation: {e}")
189
- return f"Error during native LLaMA-Omni2 text generation: {str(e)}", None
190
-
191
- # Try transformers model with possible speech capabilities
192
  if not text_gen_pipeline_instance:
193
  return f"Text generation model not available. Check logs.", None
194
  if not prompt_text or not prompt_text.strip():
@@ -198,67 +122,59 @@ def generate_text_response(prompt_text):
198
  print(f"Generating response for prompt (first 100 chars): '{prompt_text[:100]}...'")
199
 
200
  # Try to use special speech generation if available
201
- if text_generation_model_id == llama_omni_model_id:
202
- model = text_gen_pipeline_instance.model
203
-
204
- # Check if model has speech generation capability
205
- if hasattr(model, "generate_with_speech") or hasattr(model, "generate_speech"):
206
- try:
207
- # Prepare inputs
208
- inputs = text_gen_pipeline_instance.tokenizer(prompt_text, return_tensors="pt").to(model.device)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
209
 
210
- # Generate with speech
211
- if hasattr(model, "generate_with_speech"):
212
- outputs = model.generate_with_speech(
213
- **inputs,
214
- max_new_tokens=150,
215
- do_sample=True,
216
- temperature=0.7,
217
- top_p=0.9
218
- )
219
- text_response = text_gen_pipeline_instance.tokenizer.decode(outputs["sequences"][0], skip_special_tokens=True)
220
- audio_data = outputs.get("speech_output", None)
221
- elif hasattr(model, "generate_speech"):
222
- # Text generation first
223
- output_ids = model.generate(
224
- **inputs,
225
- max_new_tokens=150,
226
- do_sample=True,
227
- temperature=0.7,
228
- top_p=0.9
229
- )
230
- text_response = text_gen_pipeline_instance.tokenizer.decode(output_ids[0], skip_special_tokens=True)
231
-
232
- # Then speech generation
233
- audio_data = model.generate_speech(output_ids)
234
 
235
- # Save audio if we got it
236
- if audio_data is not None:
237
- audio_path = save_audio_to_temp_file(audio_data)
238
- return text_response, audio_path
239
- else:
240
- return text_response, None
241
-
242
- except Exception as speech_error:
243
- print(f"Error generating speech with LLaMA-Omni2: {speech_error}")
244
- print("Falling back to text-only generation")
245
-
246
- # Parameters optimized for LLaMA-Omni2 text-only generation
247
- generated_outputs = text_gen_pipeline_instance(
248
- prompt_text,
249
- max_new_tokens=150,
250
- do_sample=True,
251
- temperature=0.7,
252
- top_p=0.9,
253
- num_return_sequences=1
254
- )
255
- else:
256
- # Parameters for fallback model
257
- generated_outputs = text_gen_pipeline_instance(
258
- prompt_text,
259
- max_new_tokens=100,
260
- num_return_sequences=1
261
- )
262
 
263
  response_text = generated_outputs[0]["generated_text"]
264
  print(f"Generated text-only response: '{response_text}'")
@@ -304,24 +220,18 @@ def combined_pipeline_process(audio_filepath):
304
  error_msg_for_generation = "Cannot generate response: ASR model not loaded."
305
  return transcribed_text, error_msg_for_generation, None
306
 
307
- if not text_gen_pipeline_instance and llama_omni_native_module is None:
308
  return transcribed_text, f"Cannot generate response: No text generation model available.", None
309
 
310
  final_response, audio_path = generate_text_response(transcribed_text)
311
  return transcribed_text, final_response, audio_path
312
 
313
  # Determine model status for UI
314
- if llama_omni_native_module is not None:
315
- llama_model_status = "Native LLaMA-Omni2 module loaded successfully"
316
- using_model = "LLaMA-Omni2-0.5B (native modules)"
317
- elif text_generation_model_id == llama_omni_model_id:
318
- llama_model_status = "LLaMA-Omni2 loaded via transformers"
319
- using_model = "LLaMA-Omni2-0.5B (via transformers)"
320
- elif text_generation_model_id == fallback_model_id:
321
- llama_model_status = "Failed to load - Using GPT-2 as fallback"
322
- using_model = "GPT-2 (fallback model)"
323
  else:
324
- llama_model_status = "Failed to load any text generation model"
325
  using_model = "No model available"
326
 
327
  # --- Gradio Interface Definition ---
@@ -330,23 +240,22 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Whisper + LLaMA-Omni2 Demo") as ap
330
  f"""
331
  # Speech-to-Text and Text/Speech Generation Demo
332
 
333
- This application uses **OpenAI Whisper Tiny** for speech recognition and attempts to use **LLaMA-Omni2-0.5B** for text and speech generation.
334
- If LLaMA-Omni2 cannot be loaded, it falls back to GPT-2 (text only).
335
 
336
- **Currently using:** {using_model}
337
 
338
- Upload an audio file to transcribe it. The transcribed text will then be used as a prompt for the text/speech generation model.
339
  """
340
  )
341
 
342
- with gr.Tab("Full Pipeline: Audio -> Transcription -> Generation"):
343
- gr.Markdown("### Step 1: Upload Audio -> Step 2: Transcribe -> Step 3: Generate Text/Speech")
344
- input_audio_pipeline = gr.Audio(type="filepath", label="Upload Your Audio File (.wav, .mp3)")
345
- submit_button_full = gr.Button("Run Full Process", variant="primary")
346
- output_transcription_pipeline = gr.Textbox(label="Transcribed Text (from Whisper)", lines=5)
347
- model_label = f"Generated Text (from {using_model})"
348
  output_generation_pipeline = gr.Textbox(label=model_label, lines=7)
349
- output_audio_pipeline = gr.Audio(label="Generated Speech (if available)", visible=True)
350
 
351
  submit_button_full.click(
352
  fn=combined_pipeline_process,
@@ -354,14 +263,14 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Whisper + LLaMA-Omni2 Demo") as ap
354
  outputs=[output_transcription_pipeline, output_generation_pipeline, output_audio_pipeline]
355
  )
356
 
357
- with gr.Tab("Test Speech-to-Text (Whisper Tiny)"):
358
- gr.Markdown("### Transcribe audio to text using Whisper Tiny.")
359
- input_audio_asr = gr.Audio(type="filepath", label="Upload Audio for ASR")
360
- submit_button_asr = gr.Button("Transcribe Audio", variant="secondary")
361
- output_transcription_asr = gr.Textbox(label="Transcription Result", lines=10)
362
 
363
  def asr_only_ui(audio_file):
364
- if audio_file is None: return "Please upload an audio file."
365
  transcription, _ = transcribe_audio_input(audio_file)
366
  return transcription
367
 
@@ -371,17 +280,17 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Whisper + LLaMA-Omni2 Demo") as ap
371
  outputs=[output_transcription_asr]
372
  )
373
 
374
- with gr.Tab(f"Test Text/Speech Generation"):
375
  model_name_gen = using_model
376
- gr.Markdown(f"### Generate text and speech from a prompt using {model_name_gen}.")
377
- input_text_prompt_gen = gr.Textbox(label="Your Text Prompt", placeholder="Enter text here...", lines=5)
378
- submit_button_gen = gr.Button("Generate Text & Speech", variant="secondary")
379
- output_generation_gen = gr.Textbox(label="Generated Text Result", lines=10)
380
- output_audio_gen = gr.Audio(label="Generated Speech (if available)")
381
 
382
  def text_generation_ui(prompt):
383
  if not prompt or not prompt.strip():
384
- return "Please provide a prompt first.", None
385
  response_text, audio_path = generate_text_response(prompt)
386
  return response_text, audio_path
387
 
@@ -392,40 +301,18 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Whisper + LLaMA-Omni2 Demo") as ap
392
  )
393
 
394
  gr.Markdown("--- ")
395
- gr.Markdown("### Model Loading Status (at application start):")
396
- asr_load_status = "Successfully Loaded" if asr_pipeline_instance else "Failed to Load (check console logs)"
397
-
398
- gr.Markdown(f"* **Whisper Model ({whisper_model_id}):** `{asr_load_status}`")
399
- gr.Markdown(f"* **LLaMA-Omni2 Model ({llama_omni_model_id}):** `{llama_model_status}`")
400
-
401
- if native_llama_omni_available:
402
- gr.Markdown("* **LLaMA-Omni2 Native Modules:** `Available`")
403
- else:
404
- native_error = f": {native_modules_error}" if native_modules_error else ""
405
- gr.Markdown(f"* **LLaMA-Omni2 Native Modules:** `Not Available{native_error}`")
406
 
407
- if using_model.startswith("GPT-2"):
408
- gr.Markdown(
409
- """
410
- **Note about LLaMA-Omni2-0.5B:** This model has complex dependencies and requires a specific setup environment.
411
- The system attempted to load it but fell back to GPT-2. For full functionality with LLaMA-Omni2, you should:
412
-
413
- 1. Clone the [LLaMA-Omni2 repository](https://github.com/ictnlp/LLaMA-Omni2)
414
- 2. Install the required dependencies including CosyVoice 2
415
- 3. Download the Whisper-large-v3 model and flow-matching model and vocoder of CosyVoice 2
416
- 4. Set up the controller, model worker, and web server as described in the repository
417
-
418
- Note that LLaMA-Omni2 is designed for generating both text and speech responses simultaneously.
419
- For the full experience with speech synthesis, you need the complete setup.
420
- """
421
- )
422
 
423
  # --- Launch the Gradio App ---
424
  if __name__ == "__main__":
425
  print("Launching Gradio demo...")
426
  try:
427
- app_interface.launch(share=True)
428
  except Exception as e:
429
  print(f"Error launching with share=True: {e}")
430
  print("Trying to launch without sharing...")
431
- app_interface.launch()
 
10
  import tempfile
11
  import soundfile as sf
12
 
13
+ # Configuração do caminho para os modelos
14
+ MODELS_DIR = os.environ.get("MODELS_DIR", "models")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  # --- Model Configuration ---
17
  whisper_model_id = "openai/whisper-tiny"
18
+ llama_omni_model_id = "ICTNLP/LLaMA-Omni2-0.5B" # Modelo específico que queremos usar
 
19
 
20
  # --- Device Configuration ---
21
  if torch.cuda.is_available():
 
48
  # --- Load Text Generation Model ---
49
  text_gen_pipeline_instance = None
50
  text_generation_model_id = None # Will be set to the model that successfully loads
 
51
 
52
+ # Verificar se o modelo está baixado
53
+ local_model_path = os.path.join(MODELS_DIR, os.path.basename(llama_omni_model_id))
54
+ if os.path.exists(local_model_path):
55
+ print(f"Found local model at {local_model_path}")
56
+ model_path_to_use = local_model_path
57
+ else:
58
+ print(f"Using model from Hugging Face Hub: {llama_omni_model_id}")
59
+ model_path_to_use = llama_omni_model_id
 
 
 
 
 
 
 
60
 
61
+ try:
62
+ print(f"Attempting to load LLaMA-Omni2 model: {model_path_to_use}...")
63
+ # LLaMA models often require specific loading configurations
64
+ tokenizer = AutoTokenizer.from_pretrained(
65
+ model_path_to_use,
66
+ trust_remote_code=True,
67
+ use_fast=False
68
+ )
69
+
70
+ model = AutoModelForCausalLM.from_pretrained(
71
+ model_path_to_use,
72
+ torch_dtype=dtype_for_pipelines,
73
+ trust_remote_code=True,
74
+ device_map="auto" if torch.cuda.is_available() else None,
75
+ low_cpu_mem_usage=True
76
+ )
77
+
78
+ # Check if this is a specialized Omni2 model with audio capabilities
79
+ is_omni2_speech_model = hasattr(model, "generate_with_speech") or hasattr(model, "generate_speech")
80
+
81
+ text_gen_pipeline_instance = pipeline(
82
+ "text-generation",
83
+ model=model,
84
+ tokenizer=tokenizer,
85
+ torch_dtype=dtype_for_pipelines,
86
+ device=device_for_pipelines if not torch.cuda.is_available() else None
87
+ )
88
+ text_generation_model_id = llama_omni_model_id
89
+ print(f"LLaMA-Omni2 model ({llama_omni_model_id}) loaded successfully.")
90
+ print(f"Model has speech generation capabilities: {is_omni2_speech_model}")
91
+
92
+ except Exception as e:
93
+ print(f"Error loading LLaMA-Omni2 model: {e}")
94
+ print("Não foi possível carregar o modelo LLaMA-Omni2. Verifique se o modelo está disponível ou se há erro nas configurações.")
95
+ text_gen_pipeline_instance = None
96
+ # Não há fallback para GPT-2 agora
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
  # --- Core Functions ---
99
  def transcribe_audio_input(audio_filepath):
 
113
 
114
  def generate_text_response(prompt_text):
115
  """Generate both text and speech response if possible"""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
  if not text_gen_pipeline_instance:
117
  return f"Text generation model not available. Check logs.", None
118
  if not prompt_text or not prompt_text.strip():
 
122
  print(f"Generating response for prompt (first 100 chars): '{prompt_text[:100]}...'")
123
 
124
  # Try to use special speech generation if available
125
+ model = text_gen_pipeline_instance.model
126
+
127
+ # Check if model has speech generation capability
128
+ if hasattr(model, "generate_with_speech") or hasattr(model, "generate_speech"):
129
+ try:
130
+ # Prepare inputs
131
+ inputs = text_gen_pipeline_instance.tokenizer(prompt_text, return_tensors="pt").to(model.device)
132
+
133
+ # Generate with speech
134
+ if hasattr(model, "generate_with_speech"):
135
+ outputs = model.generate_with_speech(
136
+ **inputs,
137
+ max_new_tokens=150,
138
+ do_sample=True,
139
+ temperature=0.7,
140
+ top_p=0.9
141
+ )
142
+ text_response = text_gen_pipeline_instance.tokenizer.decode(outputs["sequences"][0], skip_special_tokens=True)
143
+ audio_data = outputs.get("speech_output", None)
144
+ elif hasattr(model, "generate_speech"):
145
+ # Text generation first
146
+ output_ids = model.generate(
147
+ **inputs,
148
+ max_new_tokens=150,
149
+ do_sample=True,
150
+ temperature=0.7,
151
+ top_p=0.9
152
+ )
153
+ text_response = text_gen_pipeline_instance.tokenizer.decode(output_ids[0], skip_special_tokens=True)
154
 
155
+ # Then speech generation
156
+ audio_data = model.generate_speech(output_ids)
157
+
158
+ # Save audio if we got it
159
+ if audio_data is not None:
160
+ audio_path = save_audio_to_temp_file(audio_data)
161
+ return text_response, audio_path
162
+ else:
163
+ return text_response, None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
 
165
+ except Exception as speech_error:
166
+ print(f"Error generating speech with LLaMA-Omni2: {speech_error}")
167
+ print("Falling back to text-only generation")
168
+
169
+ # Parameters optimized for LLaMA-Omni2 text-only generation
170
+ generated_outputs = text_gen_pipeline_instance(
171
+ prompt_text,
172
+ max_new_tokens=150,
173
+ do_sample=True,
174
+ temperature=0.7,
175
+ top_p=0.9,
176
+ num_return_sequences=1
177
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
178
 
179
  response_text = generated_outputs[0]["generated_text"]
180
  print(f"Generated text-only response: '{response_text}'")
 
220
  error_msg_for_generation = "Cannot generate response: ASR model not loaded."
221
  return transcribed_text, error_msg_for_generation, None
222
 
223
+ if not text_gen_pipeline_instance:
224
  return transcribed_text, f"Cannot generate response: No text generation model available.", None
225
 
226
  final_response, audio_path = generate_text_response(transcribed_text)
227
  return transcribed_text, final_response, audio_path
228
 
229
  # Determine model status for UI
230
+ if text_generation_model_id == llama_omni_model_id:
231
+ llama_model_status = "LLaMA-Omni2-0.5B loaded successfully"
232
+ using_model = "LLaMA-Omni2-0.5B"
 
 
 
 
 
 
233
  else:
234
+ llama_model_status = "Failed to load LLaMA-Omni2 model"
235
  using_model = "No model available"
236
 
237
  # --- Gradio Interface Definition ---
 
240
  f"""
241
  # Speech-to-Text and Text/Speech Generation Demo
242
 
243
+ Esta aplicação usa **OpenAI Whisper Tiny** para reconhecimento de fala e **LLaMA-Omni2-0.5B** para geração de texto e fala.
 
244
 
245
+ **Modelo em uso:** {using_model}
246
 
247
+ Envie um arquivo de áudio para transcrevê-lo. O texto transcrito será então usado como prompt para o modelo de geração de texto/fala.
248
  """
249
  )
250
 
251
+ with gr.Tab("Pipeline Completo: Áudio -> Transcrição -> Geração"):
252
+ gr.Markdown("### Etapa 1: Envie Áudio -> Etapa 2: Transcrição -> Etapa 3: Geração de Texto/Fala")
253
+ input_audio_pipeline = gr.Audio(type="filepath", label="Envie seu arquivo de áudio (.wav, .mp3)")
254
+ submit_button_full = gr.Button("Executar Processo Completo", variant="primary")
255
+ output_transcription_pipeline = gr.Textbox(label="Texto Transcrito (do Whisper)", lines=5)
256
+ model_label = f"Texto Gerado (do {using_model})"
257
  output_generation_pipeline = gr.Textbox(label=model_label, lines=7)
258
+ output_audio_pipeline = gr.Audio(label="Fala Gerada (se disponível)", visible=True)
259
 
260
  submit_button_full.click(
261
  fn=combined_pipeline_process,
 
263
  outputs=[output_transcription_pipeline, output_generation_pipeline, output_audio_pipeline]
264
  )
265
 
266
+ with gr.Tab("Testar Reconhecimento de Fala (Whisper Tiny)"):
267
+ gr.Markdown("### Transcreva áudio para texto usando Whisper Tiny.")
268
+ input_audio_asr = gr.Audio(type="filepath", label="Envie Áudio para Reconhecimento")
269
+ submit_button_asr = gr.Button("Transcrever Áudio", variant="secondary")
270
+ output_transcription_asr = gr.Textbox(label="Resultado da Transcrição", lines=10)
271
 
272
  def asr_only_ui(audio_file):
273
+ if audio_file is None: return "Por favor, envie um arquivo de áudio."
274
  transcription, _ = transcribe_audio_input(audio_file)
275
  return transcription
276
 
 
280
  outputs=[output_transcription_asr]
281
  )
282
 
283
+ with gr.Tab(f"Testar Geração de Texto/Fala"):
284
  model_name_gen = using_model
285
+ gr.Markdown(f"### Gere texto e fala a partir de um prompt usando {model_name_gen}.")
286
+ input_text_prompt_gen = gr.Textbox(label="Seu Prompt de Texto", placeholder="Digite seu texto aqui...", lines=5)
287
+ submit_button_gen = gr.Button("Gerar Texto e Fala", variant="secondary")
288
+ output_generation_gen = gr.Textbox(label="Resultado do Texto Gerado", lines=10)
289
+ output_audio_gen = gr.Audio(label="Fala Gerada (se disponível)")
290
 
291
  def text_generation_ui(prompt):
292
  if not prompt or not prompt.strip():
293
+ return "Por favor, forneça um prompt primeiro.", None
294
  response_text, audio_path = generate_text_response(prompt)
295
  return response_text, audio_path
296
 
 
301
  )
302
 
303
  gr.Markdown("--- ")
304
+ gr.Markdown("### Status do Carregamento do Modelo (na inicialização do aplicativo):")
305
+ asr_load_status = "Carregado com sucesso" if asr_pipeline_instance else "Falha ao carregar (verifique os logs)"
 
 
 
 
 
 
 
 
 
306
 
307
+ gr.Markdown(f"* **Modelo Whisper ({whisper_model_id}):** `{asr_load_status}`")
308
+ gr.Markdown(f"* **Modelo LLaMA-Omni2 ({llama_omni_model_id}):** `{llama_model_status}`")
 
 
 
 
 
 
 
 
 
 
 
 
 
309
 
310
  # --- Launch the Gradio App ---
311
  if __name__ == "__main__":
312
  print("Launching Gradio demo...")
313
  try:
314
+ app_interface.launch(share=True, server_name="0.0.0.0")
315
  except Exception as e:
316
  print(f"Error launching with share=True: {e}")
317
  print("Trying to launch without sharing...")
318
+ app_interface.launch(server_name="0.0.0.0")
app.yaml ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ sdk: docker
2
+ build_config:
3
+ gpu: true
4
+ cuda: "11.8"
5
+ python_version: "3.10"
6
+ system_packages:
7
+ - "ffmpeg"
8
+ - "libsndfile1"
9
+ resources:
10
+ gpu: 1
11
+ cpu: 2
12
+ memory: "16G"
13
+ disk: "10G"
14
+ models:
15
+ - "openai/whisper-tiny"
16
+ - "ICTNLP/LLaMA-Omni2-0.5B"
requirements.txt CHANGED
@@ -1,23 +1,25 @@
1
- torch>=2.1.0
2
  torchaudio>=2.1.0
3
  torchvision>=0.16.0
4
  # packaging # often a dep of others
5
  # ninja # often a dep of others
6
  uvicorn
7
- gradio>=3.50.2 # Keep Gradio, LLaMA-Omni2 uses it. Update if a newer version is needed.
8
  einops
9
- transformers>=4.36.2 # Or a version compatible with LLaMA-Omni2 and Whisper
10
- accelerate
11
  bitsandbytes # If LLaMA-Omni2 makes use of it for 4/8bit loading
12
- sentencepiece
13
  protobuf
14
- openai-whisper
15
  shortuuid
16
  pydub
17
  ffmpeg-python
18
  huggingface_hub # For downloading models from HF Hub
19
- soundfile # To handle audio files if not using gr.Audio input directly for some reason
20
- safetensors
21
  ai2-olmo # In case LLaMA-Omni2 uses olmo under the hood for the LLM part
22
 
23
- # fairseq and flash-attn are removed, expected to be handled by LLaMA-Omni2's setup via `pip install -e .` in Dockerfile
 
 
 
1
+ torch>=2.0.0
2
  torchaudio>=2.1.0
3
  torchvision>=0.16.0
4
  # packaging # often a dep of others
5
  # ninja # often a dep of others
6
  uvicorn
7
+ gradio>=5.29.0
8
  einops
9
+ transformers>=4.28.1,<5.0.0
10
+ accelerate>=0.33.0
11
  bitsandbytes # If LLaMA-Omni2 makes use of it for 4/8bit loading
12
+ sentencepiece>=0.1.99
13
  protobuf
14
+ openai-whisper>=20230918
15
  shortuuid
16
  pydub
17
  ffmpeg-python
18
  huggingface_hub # For downloading models from HF Hub
19
+ soundfile>=0.13.0
20
+ safetensors>=0.3.1
21
  ai2-olmo # In case LLaMA-Omni2 uses olmo under the hood for the LLM part
22
 
23
+ # fairseq and flash-attn are removed, expected to be handled by LLaMA-Omni2's setup via `pip install -e .` in Dockerfile
24
+
25
+ numpy>=1.21.0