Spaces:
Build error
Build error
Commit
Β·
d9bc8de
1
Parent(s):
18dcc68
fdfddfd
Browse files- README.md +49 -2
- pre-requirements.txt +3 -0
- requirements_spaces.txt +1 -1
README.md
CHANGED
@@ -9,9 +9,56 @@ app_file: app_gradio_spaces.py
|
|
9 |
pinned: false
|
10 |
---
|
11 |
|
12 |
-
# π¦π§ LLaMA-Omni: Seamless Speech Interaction
|
13 |
|
14 |
-
This is a
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
## π‘ Highlights
|
17 |
|
|
|
9 |
pinned: false
|
10 |
---
|
11 |
|
12 |
+
# π¦π§ LLaMA-Omni: Seamless Speech Interaction
|
13 |
|
14 |
+
This is a Hugging Face Spaces deployment of LLaMA-Omni, a speech-language model that can process both speech and text inputs and generate both text and speech responses.
|
15 |
+
|
16 |
+
## π Features
|
17 |
+
|
18 |
+
- **Speech-to-Text**: Record your voice or upload audio to interact with the model
|
19 |
+
- **Text Input**: Type messages directly for text-based conversation
|
20 |
+
- **Text-to-Speech**: Hear the model's responses in natural-sounding speech
|
21 |
+
- **Seamless Experience**: Switch between voice and text interaction modes
|
22 |
+
|
23 |
+
## π οΈ Technology Stack
|
24 |
+
|
25 |
+
- **Base Model**: Llama-3.1-8B-Instruct fine-tuned for speech interaction
|
26 |
+
- **Speech Recognition**: OpenAI Whisper large-v3 for accurate transcription
|
27 |
+
- **Text-to-Speech**: Custom vocoder for natural speech generation
|
28 |
+
|
29 |
+
## π Usage
|
30 |
+
|
31 |
+
1. Click the "Setup Environment" button to initialize the model
|
32 |
+
2. Wait for setup to complete (downloading models may take a few minutes)
|
33 |
+
3. Click "Start LLaMA-Omni Services" to start the model
|
34 |
+
4. Choose either:
|
35 |
+
- **Speech Input**: Record or upload audio to speak to the model
|
36 |
+
- **Text Input**: Type your message directly
|
37 |
+
5. Press "Submit" to get a response
|
38 |
+
|
39 |
+
## π§ Technical Details
|
40 |
+
|
41 |
+
This model combines large language model capabilities with speech processing to create a natural multimodal interaction experience. The architecture integrates:
|
42 |
+
|
43 |
+
- Speech recognition using Whisper
|
44 |
+
- Text generation with a fine-tuned Llama 3.1 8B model
|
45 |
+
- Speech synthesis with a high-quality vocoder
|
46 |
+
|
47 |
+
## π‘ Tips
|
48 |
+
|
49 |
+
- Speak clearly for best speech recognition results
|
50 |
+
- Short, clear questions tend to work best
|
51 |
+
- Give the model a moment to process complex inputs
|
52 |
+
|
53 |
+
## π Limitations
|
54 |
+
|
55 |
+
- Processing speech may take a few seconds depending on server load
|
56 |
+
- The model works best with English language inputs
|
57 |
+
- Complex or very long conversations may occasionally lead to less coherent responses
|
58 |
+
|
59 |
+
---
|
60 |
+
|
61 |
+
Developed based on [LLaMA-Omni](https://github.com/ICTNLP/LLaMA-Omni) by ICTNLP.
|
62 |
|
63 |
## π‘ Highlights
|
64 |
|
pre-requirements.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
pip<24.1
|
2 |
+
omegaconf @ git+https://github.com/omry/omegaconf.git@fd9109cff74d05794e14562dcbdde442eb24635d
|
3 |
+
hydra-core==1.0.7
|
requirements_spaces.txt
CHANGED
@@ -8,7 +8,7 @@ torch>=2.0.0
|
|
8 |
numpy>=1.24.0
|
9 |
transformers>=4.34.0
|
10 |
accelerate>=0.21.0
|
11 |
-
gradio>=3.50.2
|
12 |
fastapi>=0.104.0
|
13 |
uvicorn>=0.23.2
|
14 |
pydantic>=2.3.0
|
|
|
8 |
numpy>=1.24.0
|
9 |
transformers>=4.34.0
|
10 |
accelerate>=0.21.0
|
11 |
+
gradio>=3.50.2,<4.0.0 # Stay below 4.0 to maintain compatibility
|
12 |
fastapi>=0.104.0
|
13 |
uvicorn>=0.23.2
|
14 |
pydantic>=2.3.0
|