Spaces:
Build error
Build error
File size: 5,038 Bytes
e68727e 33efdd3 e68727e d9bc8de fbce578 d9bc8de fbce578 34b8b49 fbce578 34b8b49 fbce578 34b8b49 fbce578 34b8b49 c3907b6 34b8b49 c3907b6 34b8b49 c3907b6 34b8b49 c3907b6 34b8b49 1cd5253 34b8b49 c3907b6 1cd5253 34b8b49 1cd5253 34b8b49 c3907b6 c57019c c3907b6 c57019c c3907b6 c57019c c3907b6 34b8b49 c3907b6 34b8b49 c3907b6 34b8b49 c3907b6 34b8b49 c3907b6 c57019c c3907b6 34b8b49 c3907b6 34b8b49 c3907b6 34b8b49 c3907b6 34b8b49 c3907b6 34b8b49 c3907b6 34b8b49 c3907b6 34b8b49 fbce578 34b8b49 fbce578 34b8b49 fbce578 34b8b49 fbce578 34b8b49 fbce578 34b8b49 fbce578 34b8b49 fbce578 34b8b49 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
---
title: LLaMA-Omni
emoji: π¦π§
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: false
---
# π¦π§ LLaMA-Omni: Seamless Speech Interaction
This is a Hugging Face Spaces deployment of LLaMA-Omni, a speech-language model that can process both speech and text inputs and generate both text and speech responses.
## π Features
- **Speech-to-Text**: Record your voice or upload audio to interact with the model
- **Text Input**: Type messages directly for text-based conversation
- **Text-to-Speech**: Hear the model's responses in natural-sounding speech
- **Seamless Experience**: Switch between voice and text interaction modes
## π οΈ Technology Stack
- **Base Model**: Llama-3.1-8B-Instruct fine-tuned for speech interaction
- **Speech Recognition**: OpenAI Whisper large-v3 for accurate transcription
- **Text-to-Speech**: Custom vocoder for natural speech generation
## π Usage
1. Click the "Setup Environment" button to initialize the model
2. Wait for setup to complete (downloading models may take a few minutes)
3. Click "Start LLaMA-Omni Services" to start the model
4. Choose either:
- **Speech Input**: Record or upload audio to speak to the model
- **Text Input**: Type your message directly
5. Press "Submit" to get a response
## π§ Technical Details
This model combines large language model capabilities with speech processing to create a natural multimodal interaction experience. The architecture integrates:
- Speech recognition using Whisper
- Text generation with a fine-tuned Llama 3.1 8B model
- Speech synthesis with a high-quality vocoder
## π‘ Tips
- Speak clearly for best speech recognition results
- Short, clear questions tend to work best
- Give the model a moment to process complex inputs
## π Limitations
- Processing speech may take a few seconds depending on server load
- The model works best with English language inputs
- Complex or very long conversations may occasionally lead to less coherent responses
---
Developed based on [LLaMA-Omni](https://github.com/ICTNLP/LLaMA-Omni) by ICTNLP.
## π‘ Highlights
* πͺ **Built on Llama-3.1-8B-Instruct, ensuring high-quality responses.**
* π **Low-latency speech interaction with a latency as low as 226ms.**
* π§ **Simultaneous generation of both text and speech responses.**
## π Prerequisites
- Python 3.10+
- PyTorch 2.0+
- CUDA-compatible GPU (for optimal performance)
## π οΈ Setup
1. Clone this repository:
```bash
git clone https://github.com/your-username/llama-omni.git
cd llama-omni
```
2. Create a virtual environment and install dependencies:
```bash
conda create -n llama-omni python=3.10
conda activate llama-omni
pip install -e .
```
3. Install fairseq:
```bash
pip install git+https://github.com/pytorch/fairseq.git
```
4. Install optional dependencies (if not on Mac M1/M2):
```bash
# Only run this if not on Mac with Apple Silicon
pip install flash-attn
```
## π³ Docker Deployment
We provide Docker support for easy deployment without worrying about dependencies:
1. Make sure Docker and Docker Compose are installed on your system
2. Build and run the container:
```bash
# Using the provided shell script
./run_docker.sh
# Or manually with docker-compose
docker-compose up --build
```
3. Access the application at http://localhost:7860
The Docker container will automatically:
- Install all required dependencies
- Download the necessary model files
- Start the application
### GPU Support
The Docker setup includes NVIDIA GPU support. Make sure you have:
- NVIDIA drivers installed on your host
- NVIDIA Container Toolkit installed (for GPU passthrough)
## π Gradio Spaces Deployment
To deploy on Gradio Spaces:
1. Create a new Gradio Space
2. Connect this GitHub repository
3. Set the environment requirements (Python 3.10)
4. Deploy!
The app will automatically:
- Download the required models (Whisper, LLaMA-Omni, vocoder)
- Start the controller
- Start the model worker
- Launch the web interface
## π₯οΈ Local Usage
If you want to run the application locally without Docker:
```bash
python app.py
```
This will:
1. Start the controller
2. Start a model worker that loads LLaMA-Omni
3. Launch a web interface
You can then access the interface at: http://localhost:8000
## π Example Usage
### Speech-to-Speech
1. Select the "Speech Input" tab
2. Record or upload audio
3. Click "Submit"
4. Receive both text and speech responses
### Text-to-Speech
1. Select the "Text Input" tab
2. Type your message
3. Click "Submit"
4. Receive both text and speech responses
## π Development
To contribute to this project:
1. Fork the repository
2. Make your changes
3. Submit a pull request
## π LICENSE
This code is released under the Apache-2.0 License. The model is intended for academic research purposes only and may **NOT** be used for commercial purposes.
Original work by Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng. |