Spaces:

marcosremar2
/

llama-omni

Build error

App Files Files Community

marcosremar2 commited on 21 days ago

Commit

d9bc8de

1 Parent(s): 18dcc68

fdfddfd

Browse files

Files changed (3) hide show

README.md +49 -2
pre-requirements.txt +3 -0
requirements_spaces.txt +1 -1

README.md CHANGED Viewed

@@ -9,9 +9,56 @@ app_file: app_gradio_spaces.py
 pinned: false
 ---
-# 🦙🎧 LLaMA-Omni: Seamless Speech Interaction with Large Language Models
-This is a Gradio deployment of [LLaMA-Omni](https://github.com/ictnlp/LLaMA-Omni), a speech-language model built upon Llama-3.1-8B-Instruct. It supports low-latency and high-quality speech interactions, simultaneously generating both text and speech responses based on speech instructions.
 ## 💡 Highlights

 pinned: false
 ---
+# 🦙🎧 LLaMA-Omni: Seamless Speech Interaction
+This is a Hugging Face Spaces deployment of LLaMA-Omni, a speech-language model that can process both speech and text inputs and generate both text and speech responses.
+## 🚀 Features
+- **Speech-to-Text**: Record your voice or upload audio to interact with the model
+- **Text Input**: Type messages directly for text-based conversation
+- **Text-to-Speech**: Hear the model's responses in natural-sounding speech
+- **Seamless Experience**: Switch between voice and text interaction modes
+## 🛠️ Technology Stack
+- **Base Model**: Llama-3.1-8B-Instruct fine-tuned for speech interaction
+- **Speech Recognition**: OpenAI Whisper large-v3 for accurate transcription
+- **Text-to-Speech**: Custom vocoder for natural speech generation
+## 📊 Usage
+1. Click the "Setup Environment" button to initialize the model
+2. Wait for setup to complete (downloading models may take a few minutes)
+3. Click "Start LLaMA-Omni Services" to start the model
+4. Choose either:
+   - **Speech Input**: Record or upload audio to speak to the model
+   - **Text Input**: Type your message directly
+5. Press "Submit" to get a response
+## 🧠 Technical Details
+This model combines large language model capabilities with speech processing to create a natural multimodal interaction experience. The architecture integrates:
+- Speech recognition using Whisper
+- Text generation with a fine-tuned Llama 3.1 8B model
+- Speech synthesis with a high-quality vocoder
+## 💡 Tips
+- Speak clearly for best speech recognition results
+- Short, clear questions tend to work best
+- Give the model a moment to process complex inputs
+## 🔄 Limitations
+- Processing speech may take a few seconds depending on server load
+- The model works best with English language inputs
+- Complex or very long conversations may occasionally lead to less coherent responses
+---
+Developed based on [LLaMA-Omni](https://github.com/ICTNLP/LLaMA-Omni) by ICTNLP.
 ## 💡 Highlights

pre-requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+pip<24.1
+omegaconf @ git+https://github.com/omry/omegaconf.git@fd9109cff74d05794e14562dcbdde442eb24635d
+hydra-core==1.0.7

requirements_spaces.txt CHANGED Viewed

@@ -8,7 +8,7 @@ torch>=2.0.0
 numpy>=1.24.0
 transformers>=4.34.0
 accelerate>=0.21.0
-gradio>=3.50.2
 fastapi>=0.104.0
 uvicorn>=0.23.2
 pydantic>=2.3.0

 numpy>=1.24.0
 transformers>=4.34.0
 accelerate>=0.21.0
+gradio>=3.50.2,<4.0.0  # Stay below 4.0 to maintain compatibility
 fastapi>=0.104.0
 uvicorn>=0.23.2
 pydantic>=2.3.0