---
title: LLaMA-Omni
emoji: 🦙🎧
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: false
---

# 🦙🎧 LLaMA-Omni: Seamless Speech Interaction

This is a Hugging Face Spaces deployment of LLaMA-Omni, a speech-language model that can process both speech and text inputs and generate both text and speech responses.

## 🚀 Features

- **Speech-to-Text**: Record your voice or upload audio to interact with the model
- **Text Input**: Type messages directly for text-based conversation
- **Text-to-Speech**: Hear the model's responses in natural-sounding speech
- **Seamless Experience**: Switch between voice and text interaction modes

## 🛠️ Technology Stack

- **Base Model**: Llama-3.1-8B-Instruct fine-tuned for speech interaction
- **Speech Recognition**: OpenAI Whisper large-v3 for accurate transcription
- **Text-to-Speech**: Custom vocoder for natural speech generation

## 📊 Usage

1. Click the "Setup Environment" button to initialize the model
2. Wait for setup to complete (downloading models may take a few minutes)
3. Click "Start LLaMA-Omni Services" to start the model
4. Choose either:
   - **Speech Input**: Record or upload audio to speak to the model
   - **Text Input**: Type your message directly
5. Press "Submit" to get a response

## 🧠 Technical Details

This model combines large language model capabilities with speech processing to create a natural multimodal interaction experience. The architecture integrates:

- Speech recognition using Whisper
- Text generation with a fine-tuned Llama 3.1 8B model
- Speech synthesis with a high-quality vocoder

## 💡 Tips

- Speak clearly for best speech recognition results
- Short, clear questions tend to work best
- Give the model a moment to process complex inputs

## 🔄 Limitations

- Processing speech may take a few seconds depending on server load
- The model works best with English language inputs
- Complex or very long conversations may occasionally lead to less coherent responses

---

Developed based on [LLaMA-Omni](https://github.com/ICTNLP/LLaMA-Omni) by ICTNLP.

## 💡 Highlights

* 💪 **Built on Llama-3.1-8B-Instruct, ensuring high-quality responses.**
* 🚀 **Low-latency speech interaction with a latency as low as 226ms.**
* 🎧 **Simultaneous generation of both text and speech responses.**

## 📋 Prerequisites

- Python 3.10+
- PyTorch 2.0+
- CUDA-compatible GPU (for optimal performance)

## 🛠️ Setup

1. Clone this repository:
   ```bash
   git clone https://github.com/your-username/llama-omni.git
   cd llama-omni
   ```

2. Create a virtual environment and install dependencies:
   ```bash
   conda create -n llama-omni python=3.10
   conda activate llama-omni
   pip install -e .
   ```

3. Install fairseq:
   ```bash
   pip install git+https://github.com/pytorch/fairseq.git
   ```

4. Install optional dependencies (if not on Mac M1/M2):
   ```bash
   # Only run this if not on Mac with Apple Silicon
   pip install flash-attn
   ```

## 🐳 Docker Deployment

We provide Docker support for easy deployment without worrying about dependencies:

1. Make sure Docker and Docker Compose are installed on your system

2. Build and run the container:
   ```bash
   # Using the provided shell script
   ./run_docker.sh
   
   # Or manually with docker-compose
   docker-compose up --build
   ```

3. Access the application at http://localhost:7860

The Docker container will automatically:
- Install all required dependencies
- Download the necessary model files
- Start the application

### GPU Support

The Docker setup includes NVIDIA GPU support. Make sure you have:
- NVIDIA drivers installed on your host
- NVIDIA Container Toolkit installed (for GPU passthrough)

## 🚀 Gradio Spaces Deployment

To deploy on Gradio Spaces:

1. Create a new Gradio Space
2. Connect this GitHub repository
3. Set the environment requirements (Python 3.10)
4. Deploy!

The app will automatically:
- Download the required models (Whisper, LLaMA-Omni, vocoder)
- Start the controller
- Start the model worker
- Launch the web interface

## 🖥️ Local Usage

If you want to run the application locally without Docker:

```bash
python app.py
```

This will:
1. Start the controller
2. Start a model worker that loads LLaMA-Omni
3. Launch a web interface

You can then access the interface at: http://localhost:8000

## 📝 Example Usage

### Speech-to-Speech

1. Select the "Speech Input" tab
2. Record or upload audio
3. Click "Submit"
4. Receive both text and speech responses

### Text-to-Speech

1. Select the "Text Input" tab
2. Type your message
3. Click "Submit"
4. Receive both text and speech responses

## 📚 Development

To contribute to this project:

1. Fork the repository
2. Make your changes
3. Submit a pull request

## 📄 LICENSE

This code is released under the Apache-2.0 License. The model is intended for academic research purposes only and may **NOT** be used for commercial purposes.

Original work by Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng.