--- title: LLaMA-Omni emoji: 🦙🎧 colorFrom: indigo colorTo: purple sdk: docker pinned: false --- # 🦙🎧 LLaMA-Omni: Seamless Speech Interaction This is a Hugging Face Spaces deployment of LLaMA-Omni, a speech-language model that can process both speech and text inputs and generate both text and speech responses. ## 🚀 Features - **Speech-to-Text**: Record your voice or upload audio to interact with the model - **Text Input**: Type messages directly for text-based conversation - **Text-to-Speech**: Hear the model's responses in natural-sounding speech - **Seamless Experience**: Switch between voice and text interaction modes ## 🛠️ Technology Stack - **Base Model**: Llama-3.1-8B-Instruct fine-tuned for speech interaction - **Speech Recognition**: OpenAI Whisper large-v3 for accurate transcription - **Text-to-Speech**: Custom vocoder for natural speech generation ## 📊 Usage 1. Click the "Setup Environment" button to initialize the model 2. Wait for setup to complete (downloading models may take a few minutes) 3. Click "Start LLaMA-Omni Services" to start the model 4. Choose either: - **Speech Input**: Record or upload audio to speak to the model - **Text Input**: Type your message directly 5. Press "Submit" to get a response ## 🧠 Technical Details This model combines large language model capabilities with speech processing to create a natural multimodal interaction experience. The architecture integrates: - Speech recognition using Whisper - Text generation with a fine-tuned Llama 3.1 8B model - Speech synthesis with a high-quality vocoder ## 💡 Tips - Speak clearly for best speech recognition results - Short, clear questions tend to work best - Give the model a moment to process complex inputs ## 🔄 Limitations - Processing speech may take a few seconds depending on server load - The model works best with English language inputs - Complex or very long conversations may occasionally lead to less coherent responses --- Developed based on [LLaMA-Omni](https://github.com/ICTNLP/LLaMA-Omni) by ICTNLP. ## 💡 Highlights * 💪 **Built on Llama-3.1-8B-Instruct, ensuring high-quality responses.** * 🚀 **Low-latency speech interaction with a latency as low as 226ms.** * 🎧 **Simultaneous generation of both text and speech responses.** ## 📋 Prerequisites - Python 3.10+ - PyTorch 2.0+ - CUDA-compatible GPU (for optimal performance) ## 🛠️ Setup 1. Clone this repository: ```bash git clone https://github.com/your-username/llama-omni.git cd llama-omni ``` 2. Create a virtual environment and install dependencies: ```bash conda create -n llama-omni python=3.10 conda activate llama-omni pip install -e . ``` 3. Install fairseq: ```bash pip install git+https://github.com/pytorch/fairseq.git ``` 4. Install optional dependencies (if not on Mac M1/M2): ```bash # Only run this if not on Mac with Apple Silicon pip install flash-attn ``` ## 🐳 Docker Deployment We provide Docker support for easy deployment without worrying about dependencies: 1. Make sure Docker and Docker Compose are installed on your system 2. Build and run the container: ```bash # Using the provided shell script ./run_docker.sh # Or manually with docker-compose docker-compose up --build ``` 3. Access the application at http://localhost:7860 The Docker container will automatically: - Install all required dependencies - Download the necessary model files - Start the application ### GPU Support The Docker setup includes NVIDIA GPU support. Make sure you have: - NVIDIA drivers installed on your host - NVIDIA Container Toolkit installed (for GPU passthrough) ## 🚀 Gradio Spaces Deployment To deploy on Gradio Spaces: 1. Create a new Gradio Space 2. Connect this GitHub repository 3. Set the environment requirements (Python 3.10) 4. Deploy! The app will automatically: - Download the required models (Whisper, LLaMA-Omni, vocoder) - Start the controller - Start the model worker - Launch the web interface ## 🖥️ Local Usage If you want to run the application locally without Docker: ```bash python app.py ``` This will: 1. Start the controller 2. Start a model worker that loads LLaMA-Omni 3. Launch a web interface You can then access the interface at: http://localhost:8000 ## 📝 Example Usage ### Speech-to-Speech 1. Select the "Speech Input" tab 2. Record or upload audio 3. Click "Submit" 4. Receive both text and speech responses ### Text-to-Speech 1. Select the "Text Input" tab 2. Type your message 3. Click "Submit" 4. Receive both text and speech responses ## 📚 Development To contribute to this project: 1. Fork the repository 2. Make your changes 3. Submit a pull request ## 📄 LICENSE This code is released under the Apache-2.0 License. The model is intended for academic research purposes only and may **NOT** be used for commercial purposes. Original work by Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng.