File size: 3,482 Bytes
e68727e
 
 
 
 
 
 
 
 
 
 
34b8b49
fbce578
34b8b49
fbce578
34b8b49
fbce578
34b8b49
 
 
fbce578
34b8b49
fbce578
34b8b49
 
 
c3907b6
34b8b49
c3907b6
34b8b49
 
 
 
 
c3907b6
34b8b49
 
 
 
 
 
c3907b6
34b8b49
 
1cd5253
34b8b49
c3907b6
1cd5253
34b8b49
1cd5253
 
34b8b49
c3907b6
c57019c
c3907b6
c57019c
c3907b6
c57019c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c3907b6
34b8b49
c3907b6
34b8b49
 
 
 
c3907b6
34b8b49
 
 
 
 
c3907b6
34b8b49
c3907b6
c57019c
c3907b6
34b8b49
 
 
c3907b6
34b8b49
 
 
 
c3907b6
34b8b49
c3907b6
34b8b49
c3907b6
34b8b49
c3907b6
34b8b49
 
 
 
c3907b6
34b8b49
fbce578
34b8b49
 
 
 
fbce578
34b8b49
fbce578
34b8b49
fbce578
34b8b49
 
 
fbce578
34b8b49
fbce578
34b8b49
fbce578
34b8b49
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
title: LLaMA-Omni
emoji: πŸ¦™πŸŽ§
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 3.50.2
app_file: app_gradio_spaces.py
pinned: false
---

# πŸ¦™πŸŽ§ LLaMA-Omni: Seamless Speech Interaction with Large Language Models

This is a Gradio deployment of [LLaMA-Omni](https://github.com/ictnlp/LLaMA-Omni), a speech-language model built upon Llama-3.1-8B-Instruct. It supports low-latency and high-quality speech interactions, simultaneously generating both text and speech responses based on speech instructions.

## πŸ’‘ Highlights

* πŸ’ͺ **Built on Llama-3.1-8B-Instruct, ensuring high-quality responses.**
* πŸš€ **Low-latency speech interaction with a latency as low as 226ms.**
* 🎧 **Simultaneous generation of both text and speech responses.**

## πŸ“‹ Prerequisites

- Python 3.10+
- PyTorch 2.0+
- CUDA-compatible GPU (for optimal performance)

## πŸ› οΈ Setup

1. Clone this repository:
   ```bash
   git clone https://github.com/your-username/llama-omni.git
   cd llama-omni
   ```

2. Create a virtual environment and install dependencies:
   ```bash
   conda create -n llama-omni python=3.10
   conda activate llama-omni
   pip install -e .
   ```

3. Install fairseq:
   ```bash
   pip install git+https://github.com/pytorch/fairseq.git
   ```

4. Install optional dependencies (if not on Mac M1/M2):
   ```bash
   # Only run this if not on Mac with Apple Silicon
   pip install flash-attn
   ```

## 🐳 Docker Deployment

We provide Docker support for easy deployment without worrying about dependencies:

1. Make sure Docker and Docker Compose are installed on your system

2. Build and run the container:
   ```bash
   # Using the provided shell script
   ./run_docker.sh
   
   # Or manually with docker-compose
   docker-compose up --build
   ```

3. Access the application at http://localhost:7860

The Docker container will automatically:
- Install all required dependencies
- Download the necessary model files
- Start the application

### GPU Support

The Docker setup includes NVIDIA GPU support. Make sure you have:
- NVIDIA drivers installed on your host
- NVIDIA Container Toolkit installed (for GPU passthrough)

## πŸš€ Gradio Spaces Deployment

To deploy on Gradio Spaces:

1. Create a new Gradio Space
2. Connect this GitHub repository
3. Set the environment requirements (Python 3.10)
4. Deploy!

The app will automatically:
- Download the required models (Whisper, LLaMA-Omni, vocoder)
- Start the controller
- Start the model worker
- Launch the web interface

## πŸ–₯️ Local Usage

If you want to run the application locally without Docker:

```bash
python app.py
```

This will:
1. Start the controller
2. Start a model worker that loads LLaMA-Omni
3. Launch a web interface

You can then access the interface at: http://localhost:8000

## πŸ“ Example Usage

### Speech-to-Speech

1. Select the "Speech Input" tab
2. Record or upload audio
3. Click "Submit"
4. Receive both text and speech responses

### Text-to-Speech

1. Select the "Text Input" tab
2. Type your message
3. Click "Submit"
4. Receive both text and speech responses

## πŸ“š Development

To contribute to this project:

1. Fork the repository
2. Make your changes
3. Submit a pull request

## πŸ“„ LICENSE

This code is released under the Apache-2.0 License. The model is intended for academic research purposes only and may **NOT** be used for commercial purposes.

Original work by Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng.