marcosremar2 commited on
Commit
d9bc8de
Β·
1 Parent(s): 18dcc68
Files changed (3) hide show
  1. README.md +49 -2
  2. pre-requirements.txt +3 -0
  3. requirements_spaces.txt +1 -1
README.md CHANGED
@@ -9,9 +9,56 @@ app_file: app_gradio_spaces.py
9
  pinned: false
10
  ---
11
 
12
- # πŸ¦™πŸŽ§ LLaMA-Omni: Seamless Speech Interaction with Large Language Models
13
 
14
- This is a Gradio deployment of [LLaMA-Omni](https://github.com/ictnlp/LLaMA-Omni), a speech-language model built upon Llama-3.1-8B-Instruct. It supports low-latency and high-quality speech interactions, simultaneously generating both text and speech responses based on speech instructions.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  ## πŸ’‘ Highlights
17
 
 
9
  pinned: false
10
  ---
11
 
12
+ # πŸ¦™πŸŽ§ LLaMA-Omni: Seamless Speech Interaction
13
 
14
+ This is a Hugging Face Spaces deployment of LLaMA-Omni, a speech-language model that can process both speech and text inputs and generate both text and speech responses.
15
+
16
+ ## πŸš€ Features
17
+
18
+ - **Speech-to-Text**: Record your voice or upload audio to interact with the model
19
+ - **Text Input**: Type messages directly for text-based conversation
20
+ - **Text-to-Speech**: Hear the model's responses in natural-sounding speech
21
+ - **Seamless Experience**: Switch between voice and text interaction modes
22
+
23
+ ## πŸ› οΈ Technology Stack
24
+
25
+ - **Base Model**: Llama-3.1-8B-Instruct fine-tuned for speech interaction
26
+ - **Speech Recognition**: OpenAI Whisper large-v3 for accurate transcription
27
+ - **Text-to-Speech**: Custom vocoder for natural speech generation
28
+
29
+ ## πŸ“Š Usage
30
+
31
+ 1. Click the "Setup Environment" button to initialize the model
32
+ 2. Wait for setup to complete (downloading models may take a few minutes)
33
+ 3. Click "Start LLaMA-Omni Services" to start the model
34
+ 4. Choose either:
35
+ - **Speech Input**: Record or upload audio to speak to the model
36
+ - **Text Input**: Type your message directly
37
+ 5. Press "Submit" to get a response
38
+
39
+ ## 🧠 Technical Details
40
+
41
+ This model combines large language model capabilities with speech processing to create a natural multimodal interaction experience. The architecture integrates:
42
+
43
+ - Speech recognition using Whisper
44
+ - Text generation with a fine-tuned Llama 3.1 8B model
45
+ - Speech synthesis with a high-quality vocoder
46
+
47
+ ## πŸ’‘ Tips
48
+
49
+ - Speak clearly for best speech recognition results
50
+ - Short, clear questions tend to work best
51
+ - Give the model a moment to process complex inputs
52
+
53
+ ## πŸ”„ Limitations
54
+
55
+ - Processing speech may take a few seconds depending on server load
56
+ - The model works best with English language inputs
57
+ - Complex or very long conversations may occasionally lead to less coherent responses
58
+
59
+ ---
60
+
61
+ Developed based on [LLaMA-Omni](https://github.com/ICTNLP/LLaMA-Omni) by ICTNLP.
62
 
63
  ## πŸ’‘ Highlights
64
 
pre-requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ pip<24.1
2
+ omegaconf @ git+https://github.com/omry/omegaconf.git@fd9109cff74d05794e14562dcbdde442eb24635d
3
+ hydra-core==1.0.7
requirements_spaces.txt CHANGED
@@ -8,7 +8,7 @@ torch>=2.0.0
8
  numpy>=1.24.0
9
  transformers>=4.34.0
10
  accelerate>=0.21.0
11
- gradio>=3.50.2
12
  fastapi>=0.104.0
13
  uvicorn>=0.23.2
14
  pydantic>=2.3.0
 
8
  numpy>=1.24.0
9
  transformers>=4.34.0
10
  accelerate>=0.21.0
11
+ gradio>=3.50.2,<4.0.0 # Stay below 4.0 to maintain compatibility
12
  fastapi>=0.104.0
13
  uvicorn>=0.23.2
14
  pydantic>=2.3.0