Spaces:
Build error
Build error
Commit
·
fbce578
1
Parent(s):
538da63
Configurar LLaMA-Omni2 0.5B sem fallback para GPT-2 e preparar para deploy no Hugging Face
Browse files- Dockerfile +30 -0
- README.md +69 -1
- app.py +132 -245
- app.yaml +16 -0
- requirements.txt +11 -9
Dockerfile
ADDED
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
FROM pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime
|
2 |
+
|
3 |
+
WORKDIR /app
|
4 |
+
|
5 |
+
# Instalar dependências do sistema
|
6 |
+
RUN apt-get update && apt-get install -y \
|
7 |
+
git \
|
8 |
+
wget \
|
9 |
+
ffmpeg \
|
10 |
+
libsndfile1 \
|
11 |
+
&& rm -rf /var/lib/apt/lists/*
|
12 |
+
|
13 |
+
# Copiar os arquivos de código
|
14 |
+
COPY . .
|
15 |
+
|
16 |
+
# Preparar diretório para modelos
|
17 |
+
RUN mkdir -p models
|
18 |
+
|
19 |
+
# Instalar requisitos Python
|
20 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
21 |
+
|
22 |
+
# Expor a porta para o Gradio
|
23 |
+
EXPOSE 7860
|
24 |
+
|
25 |
+
# Definir variáveis de ambiente
|
26 |
+
ENV PYTHONUNBUFFERED=1
|
27 |
+
ENV MODELS_DIR=/app/models
|
28 |
+
|
29 |
+
# Comando para iniciar o servidor
|
30 |
+
CMD ["python", "app.py"]
|
README.md
CHANGED
@@ -12,4 +12,72 @@ pinned: false
|
|
12 |
# Ex: hardware: nvidia-t4
|
13 |
---
|
14 |
|
15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
# Ex: hardware: nvidia-t4
|
13 |
---
|
14 |
|
15 |
+
# LLaMA-Omni2 + Whisper Demo
|
16 |
+
|
17 |
+
Uma aplicação de demonstração que combina o reconhecimento de fala do Whisper com a geração de texto e fala do LLaMA-Omni2 0.5B.
|
18 |
+
|
19 |
+
## Sobre o Projeto
|
20 |
+
|
21 |
+
Esta aplicação demonstra a capacidade do modelo LLaMA-Omni2 0.5B para processar instruções de fala e gerar respostas tanto em texto quanto em fala, tudo com baixa latência. A arquitetura modular é baseada na pesquisa do Institute of Computing Technology da Chinese Academy of Sciences.
|
22 |
+
|
23 |
+
## Principais Recursos
|
24 |
+
|
25 |
+
- 🎤 **Reconhecimento de Fala**: Usando o OpenAI Whisper-tiny para transcrição de áudio
|
26 |
+
- 💬 **Geração de Texto**: Usando o modelo LLaMA-Omni2 para geração de respostas de texto
|
27 |
+
- 🔊 **Síntese de Fala**: Geração de fala a partir das respostas de texto (quando disponível)
|
28 |
+
- 🔄 **Pipeline Completo**: Fluxo integrado de áudio → texto → resposta → fala
|
29 |
+
|
30 |
+
## Como Usar
|
31 |
+
|
32 |
+
A interface Gradio oferece três modos de interação:
|
33 |
+
|
34 |
+
1. **Pipeline Completo**: Envie um arquivo de áudio, ele será transcrito e usado para gerar uma resposta de texto/fala
|
35 |
+
2. **Reconhecimento de Fala**: Teste apenas a capacidade de transcrição do Whisper
|
36 |
+
3. **Geração de Texto/Fala**: Forneça seu próprio texto para geração de resposta
|
37 |
+
|
38 |
+
## Arquitetura LLaMA-Omni2
|
39 |
+
|
40 |
+
O LLaMA-Omni2 é um modelo de linguagem e fala que consiste em 4 componentes principais:
|
41 |
+
|
42 |
+
1. **Codificador de Fala**: Baseado no Whisper-large-v3, converte entrada de fala em representações acústicas
|
43 |
+
2. **Adaptador de Fala**: Ponte entre os espaços acústico e textual
|
44 |
+
3. **Núcleo LLM**: O "motor de raciocínio" baseado em Qwen2.5-Instruct
|
45 |
+
4. **Decodificador TTS Streaming**: Converte tokens de texto em fala de forma contínua
|
46 |
+
|
47 |
+
## Configuração para Uso Local
|
48 |
+
|
49 |
+
Se você deseja executar esta aplicação localmente:
|
50 |
+
|
51 |
+
```bash
|
52 |
+
# Clone o repositório
|
53 |
+
git clone https://github.com/seu-usuario/llama-omni-demo
|
54 |
+
cd llama-omni-demo
|
55 |
+
|
56 |
+
# Instale as dependências
|
57 |
+
pip install -r requirements.txt
|
58 |
+
|
59 |
+
# Execute a aplicação
|
60 |
+
python app.py
|
61 |
+
```
|
62 |
+
|
63 |
+
## Requisitos
|
64 |
+
|
65 |
+
- Python 3.10+
|
66 |
+
- CUDA compatível (para GPU) ou CPU com pelo menos 8GB de RAM
|
67 |
+
- Dependências listadas em requirements.txt
|
68 |
+
|
69 |
+
## Limitações Atuais
|
70 |
+
|
71 |
+
- O LLaMA-Omni2 é um modelo experimental e pode gerar respostas incorretas ou imprecisas
|
72 |
+
- A geração de fala pode não estar disponível se o modelo não tiver sido carregado corretamente
|
73 |
+
- Requer recursos computacionais significativos para execução ideal
|
74 |
+
|
75 |
+
## Referências
|
76 |
+
|
77 |
+
- [Repositório LLaMA-Omni2](https://github.com/ictnlp/LLaMA-Omni2)
|
78 |
+
- [Whisper OpenAI](https://github.com/openai/whisper)
|
79 |
+
- [Artigo LLaMA-Omni2](https://arxiv.org/abs/2505.02625)
|
80 |
+
|
81 |
+
## Licença
|
82 |
+
|
83 |
+
Este projeto é licenciado sob a Licença Apache 2.0.
|
app.py
CHANGED
@@ -10,34 +10,12 @@ import numpy as np
|
|
10 |
import tempfile
|
11 |
import soundfile as sf
|
12 |
|
13 |
-
#
|
14 |
-
|
15 |
-
native_llama_omni_available = False
|
16 |
-
native_modules_error = None
|
17 |
-
|
18 |
-
if try_native_modules:
|
19 |
-
try:
|
20 |
-
# Try importing LLaMA-Omni2 specific modules using subprocess to avoid crashing if imports fail
|
21 |
-
print("Checking for LLaMA-Omni2 native modules...")
|
22 |
-
module_check_result = subprocess.run(
|
23 |
-
[sys.executable, "-c", "import llama_omni2; print('LLaMA-Omni2 modules found!')"],
|
24 |
-
capture_output=True,
|
25 |
-
text=True
|
26 |
-
)
|
27 |
-
if "LLaMA-Omni2 modules found!" in module_check_result.stdout:
|
28 |
-
print("LLaMA-Omni2 native modules are available!")
|
29 |
-
native_llama_omni_available = True
|
30 |
-
else:
|
31 |
-
print(f"LLaMA-Omni2 native modules not found: {module_check_result.stderr}")
|
32 |
-
native_modules_error = module_check_result.stderr
|
33 |
-
except Exception as e:
|
34 |
-
print(f"Error checking for LLaMA-Omni2 native modules: {e}")
|
35 |
-
native_modules_error = str(e)
|
36 |
|
37 |
# --- Model Configuration ---
|
38 |
whisper_model_id = "openai/whisper-tiny"
|
39 |
-
llama_omni_model_id = "ICTNLP/LLaMA-Omni2-0.5B" #
|
40 |
-
fallback_model_id = "gpt2" # Fallback if LLaMA-Omni2 fails to load
|
41 |
|
42 |
# --- Device Configuration ---
|
43 |
if torch.cuda.is_available():
|
@@ -70,72 +48,52 @@ except Exception as e:
|
|
70 |
# --- Load Text Generation Model ---
|
71 |
text_gen_pipeline_instance = None
|
72 |
text_generation_model_id = None # Will be set to the model that successfully loads
|
73 |
-
llama_omni_native_module = None # Will hold the native LLaMA-Omni2 module if loaded
|
74 |
|
75 |
-
#
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
# Load the model
|
84 |
-
llama_omni_native_module = LLamaOmniModel.from_pretrained(llama_omni_model_id)
|
85 |
-
text_generation_model_id = llama_omni_model_id
|
86 |
-
print(f"LLaMA-Omni2 native module loaded successfully: {type(llama_omni_native_module)}")
|
87 |
-
except Exception as e:
|
88 |
-
print(f"Error loading native LLaMA-Omni2 module: {e}")
|
89 |
-
llama_omni_native_module = None
|
90 |
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
print(f"Loading fallback text generation model: {fallback_model_id}...")
|
128 |
-
text_gen_pipeline_instance = pipeline(
|
129 |
-
"text-generation",
|
130 |
-
model=fallback_model_id,
|
131 |
-
torch_dtype=dtype_for_pipelines,
|
132 |
-
device=device_for_pipelines
|
133 |
-
)
|
134 |
-
text_generation_model_id = fallback_model_id
|
135 |
-
print(f"Fallback model ({fallback_model_id}) loaded successfully.")
|
136 |
-
except Exception as e:
|
137 |
-
print(f"Error loading fallback model ({fallback_model_id}): {e}")
|
138 |
-
text_gen_pipeline_instance = None
|
139 |
|
140 |
# --- Core Functions ---
|
141 |
def transcribe_audio_input(audio_filepath):
|
@@ -155,40 +113,6 @@ def transcribe_audio_input(audio_filepath):
|
|
155 |
|
156 |
def generate_text_response(prompt_text):
|
157 |
"""Generate both text and speech response if possible"""
|
158 |
-
# If we have a native LLaMA-Omni2 module, use it for text and speech
|
159 |
-
if llama_omni_native_module is not None:
|
160 |
-
if not prompt_text or not prompt_text.strip():
|
161 |
-
return "Prompt is empty. Please provide text for generation.", None
|
162 |
-
try:
|
163 |
-
print(f"Generating response with native LLaMA-Omni2 for prompt: '{prompt_text[:100]}...'")
|
164 |
-
|
165 |
-
# Using the native module's interface for text and speech generation
|
166 |
-
if hasattr(llama_omni_native_module, "generate_with_speech"):
|
167 |
-
# This method should return both text and audio
|
168 |
-
text_response, audio_data = llama_omni_native_module.generate_with_speech(
|
169 |
-
prompt_text,
|
170 |
-
max_length=150
|
171 |
-
)
|
172 |
-
|
173 |
-
# Save audio to a temporary file
|
174 |
-
if audio_data is not None:
|
175 |
-
audio_path = save_audio_to_temp_file(audio_data)
|
176 |
-
print(f"Generated response with audio: '{text_response}'")
|
177 |
-
return text_response, audio_path
|
178 |
-
else:
|
179 |
-
print(f"Generated text response (no audio): '{text_response}'")
|
180 |
-
return text_response, None
|
181 |
-
else:
|
182 |
-
# Fallback to text-only generation
|
183 |
-
response = llama_omni_native_module.generate(prompt_text, max_length=150)
|
184 |
-
print(f"Generated text-only response: '{response}'")
|
185 |
-
return response, None
|
186 |
-
|
187 |
-
except Exception as e:
|
188 |
-
print(f"Error using native LLaMA-Omni2 generation: {e}")
|
189 |
-
return f"Error during native LLaMA-Omni2 text generation: {str(e)}", None
|
190 |
-
|
191 |
-
# Try transformers model with possible speech capabilities
|
192 |
if not text_gen_pipeline_instance:
|
193 |
return f"Text generation model not available. Check logs.", None
|
194 |
if not prompt_text or not prompt_text.strip():
|
@@ -198,67 +122,59 @@ def generate_text_response(prompt_text):
|
|
198 |
print(f"Generating response for prompt (first 100 chars): '{prompt_text[:100]}...'")
|
199 |
|
200 |
# Try to use special speech generation if available
|
201 |
-
|
202 |
-
|
203 |
-
|
204 |
-
|
205 |
-
|
206 |
-
|
207 |
-
|
208 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
209 |
|
210 |
-
#
|
211 |
-
|
212 |
-
|
213 |
-
|
214 |
-
|
215 |
-
|
216 |
-
|
217 |
-
|
218 |
-
|
219 |
-
text_response = text_gen_pipeline_instance.tokenizer.decode(outputs["sequences"][0], skip_special_tokens=True)
|
220 |
-
audio_data = outputs.get("speech_output", None)
|
221 |
-
elif hasattr(model, "generate_speech"):
|
222 |
-
# Text generation first
|
223 |
-
output_ids = model.generate(
|
224 |
-
**inputs,
|
225 |
-
max_new_tokens=150,
|
226 |
-
do_sample=True,
|
227 |
-
temperature=0.7,
|
228 |
-
top_p=0.9
|
229 |
-
)
|
230 |
-
text_response = text_gen_pipeline_instance.tokenizer.decode(output_ids[0], skip_special_tokens=True)
|
231 |
-
|
232 |
-
# Then speech generation
|
233 |
-
audio_data = model.generate_speech(output_ids)
|
234 |
|
235 |
-
|
236 |
-
|
237 |
-
|
238 |
-
|
239 |
-
|
240 |
-
|
241 |
-
|
242 |
-
|
243 |
-
|
244 |
-
|
245 |
-
|
246 |
-
|
247 |
-
|
248 |
-
prompt_text,
|
249 |
-
max_new_tokens=150,
|
250 |
-
do_sample=True,
|
251 |
-
temperature=0.7,
|
252 |
-
top_p=0.9,
|
253 |
-
num_return_sequences=1
|
254 |
-
)
|
255 |
-
else:
|
256 |
-
# Parameters for fallback model
|
257 |
-
generated_outputs = text_gen_pipeline_instance(
|
258 |
-
prompt_text,
|
259 |
-
max_new_tokens=100,
|
260 |
-
num_return_sequences=1
|
261 |
-
)
|
262 |
|
263 |
response_text = generated_outputs[0]["generated_text"]
|
264 |
print(f"Generated text-only response: '{response_text}'")
|
@@ -304,24 +220,18 @@ def combined_pipeline_process(audio_filepath):
|
|
304 |
error_msg_for_generation = "Cannot generate response: ASR model not loaded."
|
305 |
return transcribed_text, error_msg_for_generation, None
|
306 |
|
307 |
-
if not text_gen_pipeline_instance
|
308 |
return transcribed_text, f"Cannot generate response: No text generation model available.", None
|
309 |
|
310 |
final_response, audio_path = generate_text_response(transcribed_text)
|
311 |
return transcribed_text, final_response, audio_path
|
312 |
|
313 |
# Determine model status for UI
|
314 |
-
if
|
315 |
-
llama_model_status = "
|
316 |
-
using_model = "LLaMA-Omni2-0.5B
|
317 |
-
elif text_generation_model_id == llama_omni_model_id:
|
318 |
-
llama_model_status = "LLaMA-Omni2 loaded via transformers"
|
319 |
-
using_model = "LLaMA-Omni2-0.5B (via transformers)"
|
320 |
-
elif text_generation_model_id == fallback_model_id:
|
321 |
-
llama_model_status = "Failed to load - Using GPT-2 as fallback"
|
322 |
-
using_model = "GPT-2 (fallback model)"
|
323 |
else:
|
324 |
-
llama_model_status = "Failed to load
|
325 |
using_model = "No model available"
|
326 |
|
327 |
# --- Gradio Interface Definition ---
|
@@ -330,23 +240,22 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Whisper + LLaMA-Omni2 Demo") as ap
|
|
330 |
f"""
|
331 |
# Speech-to-Text and Text/Speech Generation Demo
|
332 |
|
333 |
-
|
334 |
-
If LLaMA-Omni2 cannot be loaded, it falls back to GPT-2 (text only).
|
335 |
|
336 |
-
**
|
337 |
|
338 |
-
|
339 |
"""
|
340 |
)
|
341 |
|
342 |
-
with gr.Tab("
|
343 |
-
gr.Markdown("###
|
344 |
-
input_audio_pipeline = gr.Audio(type="filepath", label="
|
345 |
-
submit_button_full = gr.Button("
|
346 |
-
output_transcription_pipeline = gr.Textbox(label="
|
347 |
-
model_label = f"
|
348 |
output_generation_pipeline = gr.Textbox(label=model_label, lines=7)
|
349 |
-
output_audio_pipeline = gr.Audio(label="
|
350 |
|
351 |
submit_button_full.click(
|
352 |
fn=combined_pipeline_process,
|
@@ -354,14 +263,14 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Whisper + LLaMA-Omni2 Demo") as ap
|
|
354 |
outputs=[output_transcription_pipeline, output_generation_pipeline, output_audio_pipeline]
|
355 |
)
|
356 |
|
357 |
-
with gr.Tab("
|
358 |
-
gr.Markdown("###
|
359 |
-
input_audio_asr = gr.Audio(type="filepath", label="
|
360 |
-
submit_button_asr = gr.Button("
|
361 |
-
output_transcription_asr = gr.Textbox(label="
|
362 |
|
363 |
def asr_only_ui(audio_file):
|
364 |
-
if audio_file is None: return "
|
365 |
transcription, _ = transcribe_audio_input(audio_file)
|
366 |
return transcription
|
367 |
|
@@ -371,17 +280,17 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Whisper + LLaMA-Omni2 Demo") as ap
|
|
371 |
outputs=[output_transcription_asr]
|
372 |
)
|
373 |
|
374 |
-
with gr.Tab(f"
|
375 |
model_name_gen = using_model
|
376 |
-
gr.Markdown(f"###
|
377 |
-
input_text_prompt_gen = gr.Textbox(label="
|
378 |
-
submit_button_gen = gr.Button("
|
379 |
-
output_generation_gen = gr.Textbox(label="
|
380 |
-
output_audio_gen = gr.Audio(label="
|
381 |
|
382 |
def text_generation_ui(prompt):
|
383 |
if not prompt or not prompt.strip():
|
384 |
-
return "
|
385 |
response_text, audio_path = generate_text_response(prompt)
|
386 |
return response_text, audio_path
|
387 |
|
@@ -392,40 +301,18 @@ with gr.Blocks(theme=gr.themes.Soft(), title="Whisper + LLaMA-Omni2 Demo") as ap
|
|
392 |
)
|
393 |
|
394 |
gr.Markdown("--- ")
|
395 |
-
gr.Markdown("###
|
396 |
-
asr_load_status = "
|
397 |
-
|
398 |
-
gr.Markdown(f"* **Whisper Model ({whisper_model_id}):** `{asr_load_status}`")
|
399 |
-
gr.Markdown(f"* **LLaMA-Omni2 Model ({llama_omni_model_id}):** `{llama_model_status}`")
|
400 |
-
|
401 |
-
if native_llama_omni_available:
|
402 |
-
gr.Markdown("* **LLaMA-Omni2 Native Modules:** `Available`")
|
403 |
-
else:
|
404 |
-
native_error = f": {native_modules_error}" if native_modules_error else ""
|
405 |
-
gr.Markdown(f"* **LLaMA-Omni2 Native Modules:** `Not Available{native_error}`")
|
406 |
|
407 |
-
|
408 |
-
|
409 |
-
"""
|
410 |
-
**Note about LLaMA-Omni2-0.5B:** This model has complex dependencies and requires a specific setup environment.
|
411 |
-
The system attempted to load it but fell back to GPT-2. For full functionality with LLaMA-Omni2, you should:
|
412 |
-
|
413 |
-
1. Clone the [LLaMA-Omni2 repository](https://github.com/ictnlp/LLaMA-Omni2)
|
414 |
-
2. Install the required dependencies including CosyVoice 2
|
415 |
-
3. Download the Whisper-large-v3 model and flow-matching model and vocoder of CosyVoice 2
|
416 |
-
4. Set up the controller, model worker, and web server as described in the repository
|
417 |
-
|
418 |
-
Note that LLaMA-Omni2 is designed for generating both text and speech responses simultaneously.
|
419 |
-
For the full experience with speech synthesis, you need the complete setup.
|
420 |
-
"""
|
421 |
-
)
|
422 |
|
423 |
# --- Launch the Gradio App ---
|
424 |
if __name__ == "__main__":
|
425 |
print("Launching Gradio demo...")
|
426 |
try:
|
427 |
-
app_interface.launch(share=True)
|
428 |
except Exception as e:
|
429 |
print(f"Error launching with share=True: {e}")
|
430 |
print("Trying to launch without sharing...")
|
431 |
-
app_interface.launch()
|
|
|
10 |
import tempfile
|
11 |
import soundfile as sf
|
12 |
|
13 |
+
# Configuração do caminho para os modelos
|
14 |
+
MODELS_DIR = os.environ.get("MODELS_DIR", "models")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
# --- Model Configuration ---
|
17 |
whisper_model_id = "openai/whisper-tiny"
|
18 |
+
llama_omni_model_id = "ICTNLP/LLaMA-Omni2-0.5B" # Modelo específico que queremos usar
|
|
|
19 |
|
20 |
# --- Device Configuration ---
|
21 |
if torch.cuda.is_available():
|
|
|
48 |
# --- Load Text Generation Model ---
|
49 |
text_gen_pipeline_instance = None
|
50 |
text_generation_model_id = None # Will be set to the model that successfully loads
|
|
|
51 |
|
52 |
+
# Verificar se o modelo já está baixado
|
53 |
+
local_model_path = os.path.join(MODELS_DIR, os.path.basename(llama_omni_model_id))
|
54 |
+
if os.path.exists(local_model_path):
|
55 |
+
print(f"Found local model at {local_model_path}")
|
56 |
+
model_path_to_use = local_model_path
|
57 |
+
else:
|
58 |
+
print(f"Using model from Hugging Face Hub: {llama_omni_model_id}")
|
59 |
+
model_path_to_use = llama_omni_model_id
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
60 |
|
61 |
+
try:
|
62 |
+
print(f"Attempting to load LLaMA-Omni2 model: {model_path_to_use}...")
|
63 |
+
# LLaMA models often require specific loading configurations
|
64 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
65 |
+
model_path_to_use,
|
66 |
+
trust_remote_code=True,
|
67 |
+
use_fast=False
|
68 |
+
)
|
69 |
+
|
70 |
+
model = AutoModelForCausalLM.from_pretrained(
|
71 |
+
model_path_to_use,
|
72 |
+
torch_dtype=dtype_for_pipelines,
|
73 |
+
trust_remote_code=True,
|
74 |
+
device_map="auto" if torch.cuda.is_available() else None,
|
75 |
+
low_cpu_mem_usage=True
|
76 |
+
)
|
77 |
+
|
78 |
+
# Check if this is a specialized Omni2 model with audio capabilities
|
79 |
+
is_omni2_speech_model = hasattr(model, "generate_with_speech") or hasattr(model, "generate_speech")
|
80 |
+
|
81 |
+
text_gen_pipeline_instance = pipeline(
|
82 |
+
"text-generation",
|
83 |
+
model=model,
|
84 |
+
tokenizer=tokenizer,
|
85 |
+
torch_dtype=dtype_for_pipelines,
|
86 |
+
device=device_for_pipelines if not torch.cuda.is_available() else None
|
87 |
+
)
|
88 |
+
text_generation_model_id = llama_omni_model_id
|
89 |
+
print(f"LLaMA-Omni2 model ({llama_omni_model_id}) loaded successfully.")
|
90 |
+
print(f"Model has speech generation capabilities: {is_omni2_speech_model}")
|
91 |
+
|
92 |
+
except Exception as e:
|
93 |
+
print(f"Error loading LLaMA-Omni2 model: {e}")
|
94 |
+
print("Não foi possível carregar o modelo LLaMA-Omni2. Verifique se o modelo está disponível ou se há erro nas configurações.")
|
95 |
+
text_gen_pipeline_instance = None
|
96 |
+
# Não há fallback para GPT-2 agora
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
97 |
|
98 |
# --- Core Functions ---
|
99 |
def transcribe_audio_input(audio_filepath):
|
|
|
113 |
|
114 |
def generate_text_response(prompt_text):
|
115 |
"""Generate both text and speech response if possible"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
116 |
if not text_gen_pipeline_instance:
|
117 |
return f"Text generation model not available. Check logs.", None
|
118 |
if not prompt_text or not prompt_text.strip():
|
|
|
122 |
print(f"Generating response for prompt (first 100 chars): '{prompt_text[:100]}...'")
|
123 |
|
124 |
# Try to use special speech generation if available
|
125 |
+
model = text_gen_pipeline_instance.model
|
126 |
+
|
127 |
+
# Check if model has speech generation capability
|
128 |
+
if hasattr(model, "generate_with_speech") or hasattr(model, "generate_speech"):
|
129 |
+
try:
|
130 |
+
# Prepare inputs
|
131 |
+
inputs = text_gen_pipeline_instance.tokenizer(prompt_text, return_tensors="pt").to(model.device)
|
132 |
+
|
133 |
+
# Generate with speech
|
134 |
+
if hasattr(model, "generate_with_speech"):
|
135 |
+
outputs = model.generate_with_speech(
|
136 |
+
**inputs,
|
137 |
+
max_new_tokens=150,
|
138 |
+
do_sample=True,
|
139 |
+
temperature=0.7,
|
140 |
+
top_p=0.9
|
141 |
+
)
|
142 |
+
text_response = text_gen_pipeline_instance.tokenizer.decode(outputs["sequences"][0], skip_special_tokens=True)
|
143 |
+
audio_data = outputs.get("speech_output", None)
|
144 |
+
elif hasattr(model, "generate_speech"):
|
145 |
+
# Text generation first
|
146 |
+
output_ids = model.generate(
|
147 |
+
**inputs,
|
148 |
+
max_new_tokens=150,
|
149 |
+
do_sample=True,
|
150 |
+
temperature=0.7,
|
151 |
+
top_p=0.9
|
152 |
+
)
|
153 |
+
text_response = text_gen_pipeline_instance.tokenizer.decode(output_ids[0], skip_special_tokens=True)
|
154 |
|
155 |
+
# Then speech generation
|
156 |
+
audio_data = model.generate_speech(output_ids)
|
157 |
+
|
158 |
+
# Save audio if we got it
|
159 |
+
if audio_data is not None:
|
160 |
+
audio_path = save_audio_to_temp_file(audio_data)
|
161 |
+
return text_response, audio_path
|
162 |
+
else:
|
163 |
+
return text_response, None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
164 |
|
165 |
+
except Exception as speech_error:
|
166 |
+
print(f"Error generating speech with LLaMA-Omni2: {speech_error}")
|
167 |
+
print("Falling back to text-only generation")
|
168 |
+
|
169 |
+
# Parameters optimized for LLaMA-Omni2 text-only generation
|
170 |
+
generated_outputs = text_gen_pipeline_instance(
|
171 |
+
prompt_text,
|
172 |
+
max_new_tokens=150,
|
173 |
+
do_sample=True,
|
174 |
+
temperature=0.7,
|
175 |
+
top_p=0.9,
|
176 |
+
num_return_sequences=1
|
177 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
178 |
|
179 |
response_text = generated_outputs[0]["generated_text"]
|
180 |
print(f"Generated text-only response: '{response_text}'")
|
|
|
220 |
error_msg_for_generation = "Cannot generate response: ASR model not loaded."
|
221 |
return transcribed_text, error_msg_for_generation, None
|
222 |
|
223 |
+
if not text_gen_pipeline_instance:
|
224 |
return transcribed_text, f"Cannot generate response: No text generation model available.", None
|
225 |
|
226 |
final_response, audio_path = generate_text_response(transcribed_text)
|
227 |
return transcribed_text, final_response, audio_path
|
228 |
|
229 |
# Determine model status for UI
|
230 |
+
if text_generation_model_id == llama_omni_model_id:
|
231 |
+
llama_model_status = "LLaMA-Omni2-0.5B loaded successfully"
|
232 |
+
using_model = "LLaMA-Omni2-0.5B"
|
|
|
|
|
|
|
|
|
|
|
|
|
233 |
else:
|
234 |
+
llama_model_status = "Failed to load LLaMA-Omni2 model"
|
235 |
using_model = "No model available"
|
236 |
|
237 |
# --- Gradio Interface Definition ---
|
|
|
240 |
f"""
|
241 |
# Speech-to-Text and Text/Speech Generation Demo
|
242 |
|
243 |
+
Esta aplicação usa **OpenAI Whisper Tiny** para reconhecimento de fala e **LLaMA-Omni2-0.5B** para geração de texto e fala.
|
|
|
244 |
|
245 |
+
**Modelo em uso:** {using_model}
|
246 |
|
247 |
+
Envie um arquivo de áudio para transcrevê-lo. O texto transcrito será então usado como prompt para o modelo de geração de texto/fala.
|
248 |
"""
|
249 |
)
|
250 |
|
251 |
+
with gr.Tab("Pipeline Completo: Áudio -> Transcrição -> Geração"):
|
252 |
+
gr.Markdown("### Etapa 1: Envie Áudio -> Etapa 2: Transcrição -> Etapa 3: Geração de Texto/Fala")
|
253 |
+
input_audio_pipeline = gr.Audio(type="filepath", label="Envie seu arquivo de áudio (.wav, .mp3)")
|
254 |
+
submit_button_full = gr.Button("Executar Processo Completo", variant="primary")
|
255 |
+
output_transcription_pipeline = gr.Textbox(label="Texto Transcrito (do Whisper)", lines=5)
|
256 |
+
model_label = f"Texto Gerado (do {using_model})"
|
257 |
output_generation_pipeline = gr.Textbox(label=model_label, lines=7)
|
258 |
+
output_audio_pipeline = gr.Audio(label="Fala Gerada (se disponível)", visible=True)
|
259 |
|
260 |
submit_button_full.click(
|
261 |
fn=combined_pipeline_process,
|
|
|
263 |
outputs=[output_transcription_pipeline, output_generation_pipeline, output_audio_pipeline]
|
264 |
)
|
265 |
|
266 |
+
with gr.Tab("Testar Reconhecimento de Fala (Whisper Tiny)"):
|
267 |
+
gr.Markdown("### Transcreva áudio para texto usando Whisper Tiny.")
|
268 |
+
input_audio_asr = gr.Audio(type="filepath", label="Envie Áudio para Reconhecimento")
|
269 |
+
submit_button_asr = gr.Button("Transcrever Áudio", variant="secondary")
|
270 |
+
output_transcription_asr = gr.Textbox(label="Resultado da Transcrição", lines=10)
|
271 |
|
272 |
def asr_only_ui(audio_file):
|
273 |
+
if audio_file is None: return "Por favor, envie um arquivo de áudio."
|
274 |
transcription, _ = transcribe_audio_input(audio_file)
|
275 |
return transcription
|
276 |
|
|
|
280 |
outputs=[output_transcription_asr]
|
281 |
)
|
282 |
|
283 |
+
with gr.Tab(f"Testar Geração de Texto/Fala"):
|
284 |
model_name_gen = using_model
|
285 |
+
gr.Markdown(f"### Gere texto e fala a partir de um prompt usando {model_name_gen}.")
|
286 |
+
input_text_prompt_gen = gr.Textbox(label="Seu Prompt de Texto", placeholder="Digite seu texto aqui...", lines=5)
|
287 |
+
submit_button_gen = gr.Button("Gerar Texto e Fala", variant="secondary")
|
288 |
+
output_generation_gen = gr.Textbox(label="Resultado do Texto Gerado", lines=10)
|
289 |
+
output_audio_gen = gr.Audio(label="Fala Gerada (se disponível)")
|
290 |
|
291 |
def text_generation_ui(prompt):
|
292 |
if not prompt or not prompt.strip():
|
293 |
+
return "Por favor, forneça um prompt primeiro.", None
|
294 |
response_text, audio_path = generate_text_response(prompt)
|
295 |
return response_text, audio_path
|
296 |
|
|
|
301 |
)
|
302 |
|
303 |
gr.Markdown("--- ")
|
304 |
+
gr.Markdown("### Status do Carregamento do Modelo (na inicialização do aplicativo):")
|
305 |
+
asr_load_status = "Carregado com sucesso" if asr_pipeline_instance else "Falha ao carregar (verifique os logs)"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
306 |
|
307 |
+
gr.Markdown(f"* **Modelo Whisper ({whisper_model_id}):** `{asr_load_status}`")
|
308 |
+
gr.Markdown(f"* **Modelo LLaMA-Omni2 ({llama_omni_model_id}):** `{llama_model_status}`")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
309 |
|
310 |
# --- Launch the Gradio App ---
|
311 |
if __name__ == "__main__":
|
312 |
print("Launching Gradio demo...")
|
313 |
try:
|
314 |
+
app_interface.launch(share=True, server_name="0.0.0.0")
|
315 |
except Exception as e:
|
316 |
print(f"Error launching with share=True: {e}")
|
317 |
print("Trying to launch without sharing...")
|
318 |
+
app_interface.launch(server_name="0.0.0.0")
|
app.yaml
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
sdk: docker
|
2 |
+
build_config:
|
3 |
+
gpu: true
|
4 |
+
cuda: "11.8"
|
5 |
+
python_version: "3.10"
|
6 |
+
system_packages:
|
7 |
+
- "ffmpeg"
|
8 |
+
- "libsndfile1"
|
9 |
+
resources:
|
10 |
+
gpu: 1
|
11 |
+
cpu: 2
|
12 |
+
memory: "16G"
|
13 |
+
disk: "10G"
|
14 |
+
models:
|
15 |
+
- "openai/whisper-tiny"
|
16 |
+
- "ICTNLP/LLaMA-Omni2-0.5B"
|
requirements.txt
CHANGED
@@ -1,23 +1,25 @@
|
|
1 |
-
torch>=2.
|
2 |
torchaudio>=2.1.0
|
3 |
torchvision>=0.16.0
|
4 |
# packaging # often a dep of others
|
5 |
# ninja # often a dep of others
|
6 |
uvicorn
|
7 |
-
gradio>=
|
8 |
einops
|
9 |
-
transformers>=4.
|
10 |
-
accelerate
|
11 |
bitsandbytes # If LLaMA-Omni2 makes use of it for 4/8bit loading
|
12 |
-
sentencepiece
|
13 |
protobuf
|
14 |
-
openai-whisper
|
15 |
shortuuid
|
16 |
pydub
|
17 |
ffmpeg-python
|
18 |
huggingface_hub # For downloading models from HF Hub
|
19 |
-
soundfile
|
20 |
-
safetensors
|
21 |
ai2-olmo # In case LLaMA-Omni2 uses olmo under the hood for the LLM part
|
22 |
|
23 |
-
# fairseq and flash-attn are removed, expected to be handled by LLaMA-Omni2's setup via `pip install -e .` in Dockerfile
|
|
|
|
|
|
1 |
+
torch>=2.0.0
|
2 |
torchaudio>=2.1.0
|
3 |
torchvision>=0.16.0
|
4 |
# packaging # often a dep of others
|
5 |
# ninja # often a dep of others
|
6 |
uvicorn
|
7 |
+
gradio>=5.29.0
|
8 |
einops
|
9 |
+
transformers>=4.28.1,<5.0.0
|
10 |
+
accelerate>=0.33.0
|
11 |
bitsandbytes # If LLaMA-Omni2 makes use of it for 4/8bit loading
|
12 |
+
sentencepiece>=0.1.99
|
13 |
protobuf
|
14 |
+
openai-whisper>=20230918
|
15 |
shortuuid
|
16 |
pydub
|
17 |
ffmpeg-python
|
18 |
huggingface_hub # For downloading models from HF Hub
|
19 |
+
soundfile>=0.13.0
|
20 |
+
safetensors>=0.3.1
|
21 |
ai2-olmo # In case LLaMA-Omni2 uses olmo under the hood for the LLM part
|
22 |
|
23 |
+
# fairseq and flash-attn are removed, expected to be handled by LLaMA-Omni2's setup via `pip install -e .` in Dockerfile
|
24 |
+
|
25 |
+
numpy>=1.21.0
|