Spaces:
Build error
Build error
title: LLaMA-Omni | |
emoji: π¦π§ | |
colorFrom: indigo | |
colorTo: purple | |
sdk: docker | |
pinned: false | |
# π¦π§ LLaMA-Omni: Seamless Speech Interaction | |
This is a Hugging Face Spaces deployment of LLaMA-Omni, a speech-language model that can process both speech and text inputs and generate both text and speech responses. | |
## π Features | |
- **Speech-to-Text**: Record your voice or upload audio to interact with the model | |
- **Text Input**: Type messages directly for text-based conversation | |
- **Text-to-Speech**: Hear the model's responses in natural-sounding speech | |
- **Seamless Experience**: Switch between voice and text interaction modes | |
## π οΈ Technology Stack | |
- **Base Model**: Llama-3.1-8B-Instruct fine-tuned for speech interaction | |
- **Speech Recognition**: OpenAI Whisper large-v3 for accurate transcription | |
- **Text-to-Speech**: Custom vocoder for natural speech generation | |
## π Usage | |
1. Click the "Setup Environment" button to initialize the model | |
2. Wait for setup to complete (downloading models may take a few minutes) | |
3. Click "Start LLaMA-Omni Services" to start the model | |
4. Choose either: | |
- **Speech Input**: Record or upload audio to speak to the model | |
- **Text Input**: Type your message directly | |
5. Press "Submit" to get a response | |
## π§ Technical Details | |
This model combines large language model capabilities with speech processing to create a natural multimodal interaction experience. The architecture integrates: | |
- Speech recognition using Whisper | |
- Text generation with a fine-tuned Llama 3.1 8B model | |
- Speech synthesis with a high-quality vocoder | |
## π‘ Tips | |
- Speak clearly for best speech recognition results | |
- Short, clear questions tend to work best | |
- Give the model a moment to process complex inputs | |
## π Limitations | |
- Processing speech may take a few seconds depending on server load | |
- The model works best with English language inputs | |
- Complex or very long conversations may occasionally lead to less coherent responses | |
--- | |
Developed based on [LLaMA-Omni](https://github.com/ICTNLP/LLaMA-Omni) by ICTNLP. | |
## π‘ Highlights | |
* πͺ **Built on Llama-3.1-8B-Instruct, ensuring high-quality responses.** | |
* π **Low-latency speech interaction with a latency as low as 226ms.** | |
* π§ **Simultaneous generation of both text and speech responses.** | |
## π Prerequisites | |
- Python 3.10+ | |
- PyTorch 2.0+ | |
- CUDA-compatible GPU (for optimal performance) | |
## π οΈ Setup | |
1. Clone this repository: | |
```bash | |
git clone https://github.com/your-username/llama-omni.git | |
cd llama-omni | |
``` | |
2. Create a virtual environment and install dependencies: | |
```bash | |
conda create -n llama-omni python=3.10 | |
conda activate llama-omni | |
pip install -e . | |
``` | |
3. Install fairseq: | |
```bash | |
pip install git+https://github.com/pytorch/fairseq.git | |
``` | |
4. Install optional dependencies (if not on Mac M1/M2): | |
```bash | |
# Only run this if not on Mac with Apple Silicon | |
pip install flash-attn | |
``` | |
## π³ Docker Deployment | |
We provide Docker support for easy deployment without worrying about dependencies: | |
1. Make sure Docker and Docker Compose are installed on your system | |
2. Build and run the container: | |
```bash | |
# Using the provided shell script | |
./run_docker.sh | |
# Or manually with docker-compose | |
docker-compose up --build | |
``` | |
3. Access the application at http://localhost:7860 | |
The Docker container will automatically: | |
- Install all required dependencies | |
- Download the necessary model files | |
- Start the application | |
### GPU Support | |
The Docker setup includes NVIDIA GPU support. Make sure you have: | |
- NVIDIA drivers installed on your host | |
- NVIDIA Container Toolkit installed (for GPU passthrough) | |
## π Gradio Spaces Deployment | |
To deploy on Gradio Spaces: | |
1. Create a new Gradio Space | |
2. Connect this GitHub repository | |
3. Set the environment requirements (Python 3.10) | |
4. Deploy! | |
The app will automatically: | |
- Download the required models (Whisper, LLaMA-Omni, vocoder) | |
- Start the controller | |
- Start the model worker | |
- Launch the web interface | |
## π₯οΈ Local Usage | |
If you want to run the application locally without Docker: | |
```bash | |
python app.py | |
``` | |
This will: | |
1. Start the controller | |
2. Start a model worker that loads LLaMA-Omni | |
3. Launch a web interface | |
You can then access the interface at: http://localhost:8000 | |
## π Example Usage | |
### Speech-to-Speech | |
1. Select the "Speech Input" tab | |
2. Record or upload audio | |
3. Click "Submit" | |
4. Receive both text and speech responses | |
### Text-to-Speech | |
1. Select the "Text Input" tab | |
2. Type your message | |
3. Click "Submit" | |
4. Receive both text and speech responses | |
## π Development | |
To contribute to this project: | |
1. Fork the repository | |
2. Make your changes | |
3. Submit a pull request | |
## π LICENSE | |
This code is released under the Apache-2.0 License. The model is intended for academic research purposes only and may **NOT** be used for commercial purposes. | |
Original work by Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng. |