metadata

title: LLaMA-Omni
emoji: 🦙🎧
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: false

🦙🎧 LLaMA-Omni: Seamless Speech Interaction

This is a Hugging Face Spaces deployment of LLaMA-Omni, a speech-language model that can process both speech and text inputs and generate both text and speech responses.

🚀 Features

Speech-to-Text: Record your voice or upload audio to interact with the model
Text Input: Type messages directly for text-based conversation
Text-to-Speech: Hear the model's responses in natural-sounding speech
Seamless Experience: Switch between voice and text interaction modes

🛠️ Technology Stack

Base Model: Llama-3.1-8B-Instruct fine-tuned for speech interaction
Speech Recognition: OpenAI Whisper large-v3 for accurate transcription
Text-to-Speech: Custom vocoder for natural speech generation

📊 Usage

Click the "Setup Environment" button to initialize the model
Wait for setup to complete (downloading models may take a few minutes)
Click "Start LLaMA-Omni Services" to start the model
Choose either:
- Speech Input: Record or upload audio to speak to the model
- Text Input: Type your message directly
Press "Submit" to get a response

🧠 Technical Details

This model combines large language model capabilities with speech processing to create a natural multimodal interaction experience. The architecture integrates:

Speech recognition using Whisper
Text generation with a fine-tuned Llama 3.1 8B model
Speech synthesis with a high-quality vocoder

💡 Tips

Speak clearly for best speech recognition results
Short, clear questions tend to work best
Give the model a moment to process complex inputs

🔄 Limitations

Processing speech may take a few seconds depending on server load
The model works best with English language inputs
Complex or very long conversations may occasionally lead to less coherent responses

Developed based on LLaMA-Omni by ICTNLP.

💡 Highlights

💪 Built on Llama-3.1-8B-Instruct, ensuring high-quality responses.
🚀 Low-latency speech interaction with a latency as low as 226ms.
🎧 Simultaneous generation of both text and speech responses.

📋 Prerequisites

Python 3.10+
PyTorch 2.0+
CUDA-compatible GPU (for optimal performance)

🛠️ Setup

Clone this repository:

git clone https://github.com/your-username/llama-omni.git
cd llama-omni

Create a virtual environment and install dependencies:

conda create -n llama-omni python=3.10
conda activate llama-omni
pip install -e .

Install fairseq:

pip install git+https://github.com/pytorch/fairseq.git

Install optional dependencies (if not on Mac M1/M2):

# Only run this if not on Mac with Apple Silicon
pip install flash-attn

🐳 Docker Deployment

We provide Docker support for easy deployment without worrying about dependencies:

Make sure Docker and Docker Compose are installed on your system

Build and run the container:

# Using the provided shell script
./run_docker.sh

# Or manually with docker-compose
docker-compose up --build

Access the application at http://localhost:7860

The Docker container will automatically:

Install all required dependencies
Download the necessary model files
Start the application

GPU Support

The Docker setup includes NVIDIA GPU support. Make sure you have:

NVIDIA drivers installed on your host
NVIDIA Container Toolkit installed (for GPU passthrough)

🚀 Gradio Spaces Deployment

To deploy on Gradio Spaces:

Create a new Gradio Space
Connect this GitHub repository
Set the environment requirements (Python 3.10)
Deploy!

The app will automatically:

Download the required models (Whisper, LLaMA-Omni, vocoder)
Start the controller
Start the model worker
Launch the web interface

🖥️ Local Usage

If you want to run the application locally without Docker:

python app.py

This will:

Start the controller
Start a model worker that loads LLaMA-Omni
Launch a web interface

You can then access the interface at: http://localhost:8000

📝 Example Usage

Speech-to-Speech

Select the "Speech Input" tab
Record or upload audio
Click "Submit"
Receive both text and speech responses

Text-to-Speech

Select the "Text Input" tab
Type your message
Click "Submit"
Receive both text and speech responses

📚 Development

To contribute to this project:

Fork the repository
Make your changes
Submit a pull request

📄 LICENSE

This code is released under the Apache-2.0 License. The model is intended for academic research purposes only and may NOT be used for commercial purposes.

Original work by Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng.