llama-omni / README.md
marcosremar2's picture
ffdfdfd
33efdd3
---
title: LLaMA-Omni
emoji: πŸ¦™πŸŽ§
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: false
---
# πŸ¦™πŸŽ§ LLaMA-Omni: Seamless Speech Interaction
This is a Hugging Face Spaces deployment of LLaMA-Omni, a speech-language model that can process both speech and text inputs and generate both text and speech responses.
## πŸš€ Features
- **Speech-to-Text**: Record your voice or upload audio to interact with the model
- **Text Input**: Type messages directly for text-based conversation
- **Text-to-Speech**: Hear the model's responses in natural-sounding speech
- **Seamless Experience**: Switch between voice and text interaction modes
## πŸ› οΈ Technology Stack
- **Base Model**: Llama-3.1-8B-Instruct fine-tuned for speech interaction
- **Speech Recognition**: OpenAI Whisper large-v3 for accurate transcription
- **Text-to-Speech**: Custom vocoder for natural speech generation
## πŸ“Š Usage
1. Click the "Setup Environment" button to initialize the model
2. Wait for setup to complete (downloading models may take a few minutes)
3. Click "Start LLaMA-Omni Services" to start the model
4. Choose either:
- **Speech Input**: Record or upload audio to speak to the model
- **Text Input**: Type your message directly
5. Press "Submit" to get a response
## 🧠 Technical Details
This model combines large language model capabilities with speech processing to create a natural multimodal interaction experience. The architecture integrates:
- Speech recognition using Whisper
- Text generation with a fine-tuned Llama 3.1 8B model
- Speech synthesis with a high-quality vocoder
## πŸ’‘ Tips
- Speak clearly for best speech recognition results
- Short, clear questions tend to work best
- Give the model a moment to process complex inputs
## πŸ”„ Limitations
- Processing speech may take a few seconds depending on server load
- The model works best with English language inputs
- Complex or very long conversations may occasionally lead to less coherent responses
---
Developed based on [LLaMA-Omni](https://github.com/ICTNLP/LLaMA-Omni) by ICTNLP.
## πŸ’‘ Highlights
* πŸ’ͺ **Built on Llama-3.1-8B-Instruct, ensuring high-quality responses.**
* πŸš€ **Low-latency speech interaction with a latency as low as 226ms.**
* 🎧 **Simultaneous generation of both text and speech responses.**
## πŸ“‹ Prerequisites
- Python 3.10+
- PyTorch 2.0+
- CUDA-compatible GPU (for optimal performance)
## πŸ› οΈ Setup
1. Clone this repository:
```bash
git clone https://github.com/your-username/llama-omni.git
cd llama-omni
```
2. Create a virtual environment and install dependencies:
```bash
conda create -n llama-omni python=3.10
conda activate llama-omni
pip install -e .
```
3. Install fairseq:
```bash
pip install git+https://github.com/pytorch/fairseq.git
```
4. Install optional dependencies (if not on Mac M1/M2):
```bash
# Only run this if not on Mac with Apple Silicon
pip install flash-attn
```
## 🐳 Docker Deployment
We provide Docker support for easy deployment without worrying about dependencies:
1. Make sure Docker and Docker Compose are installed on your system
2. Build and run the container:
```bash
# Using the provided shell script
./run_docker.sh
# Or manually with docker-compose
docker-compose up --build
```
3. Access the application at http://localhost:7860
The Docker container will automatically:
- Install all required dependencies
- Download the necessary model files
- Start the application
### GPU Support
The Docker setup includes NVIDIA GPU support. Make sure you have:
- NVIDIA drivers installed on your host
- NVIDIA Container Toolkit installed (for GPU passthrough)
## πŸš€ Gradio Spaces Deployment
To deploy on Gradio Spaces:
1. Create a new Gradio Space
2. Connect this GitHub repository
3. Set the environment requirements (Python 3.10)
4. Deploy!
The app will automatically:
- Download the required models (Whisper, LLaMA-Omni, vocoder)
- Start the controller
- Start the model worker
- Launch the web interface
## πŸ–₯️ Local Usage
If you want to run the application locally without Docker:
```bash
python app.py
```
This will:
1. Start the controller
2. Start a model worker that loads LLaMA-Omni
3. Launch a web interface
You can then access the interface at: http://localhost:8000
## πŸ“ Example Usage
### Speech-to-Speech
1. Select the "Speech Input" tab
2. Record or upload audio
3. Click "Submit"
4. Receive both text and speech responses
### Text-to-Speech
1. Select the "Text Input" tab
2. Type your message
3. Click "Submit"
4. Receive both text and speech responses
## πŸ“š Development
To contribute to this project:
1. Fork the repository
2. Make your changes
3. Submit a pull request
## πŸ“„ LICENSE
This code is released under the Apache-2.0 License. The model is intended for academic research purposes only and may **NOT** be used for commercial purposes.
Original work by Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng.