|
--- |
|
title: voice-assistant |
|
app_file: gradio_app.py |
|
sdk: gradio |
|
sdk_version: 5.29.1 |
|
--- |
|
# Real-time Conversational AI Chatbot Backend |
|
|
|
This project implements a Python-based backend for a real-time conversational AI chatbot. It features Speech-to-Text (STT), Language Model (LLM) processing via Google's Gemini API, and streaming Text-to-Speech (TTS) capabilities, all orchestrated through a FastAPI web server with WebSocket support for interactive conversations. |
|
|
|
## Core Features |
|
|
|
- **Speech-to-Text (STT):** Utilizes OpenAI's Whisper model to transcribe user's spoken audio into text. |
|
- **Language Model (LLM):** Integrates with Google's Gemini API (e.g., `gemini-1.5-flash-latest`) for generating intelligent and contextual responses. |
|
- **Text-to-Speech (TTS) with Streaming:** Employs AI4Bharat's IndicParler-TTS model (via `parler-tts` library) with `ParlerTTSStreamer` to convert the LLM's text response into audible speech, streamed chunk by chunk for faster time-to-first-audio. |
|
- **Real-time Interaction:** A WebSocket endpoint (`/ws/conversation`) manages the live, bidirectional flow of audio and text data between the client and server. |
|
- **Component Testing:** Includes individual HTTP RESTful endpoints for testing STT, LLM, and TTS functionalities separately. |
|
- **Basic Client Demo:** Provides a simple HTML/JavaScript client served at the root (`/`) for demonstrating the WebSocket conversation flow. |
|
|
|
## Technologies Used |
|
|
|
- **Backend Framework:** FastAPI |
|
- **ASR (STT):** OpenAI Whisper |
|
- **LLM:** Google Gemini API (via `google-generativeai` SDK) |
|
- **TTS:** AI4Bharat IndicParler-TTS (via `parler-tts` and `transformers`) |
|
- **Audio Processing:** `soundfile`, `librosa` |
|
- **Async & Concurrency:** `asyncio`, `threading` (for ParlerTTSStreamer) |
|
- **ML/DL:** PyTorch |
|
- **Web Server:** Uvicorn |
|
|
|
## Setup and Installation |
|
|
|
1. **Clone the Repository (if applicable)** |
|
|
|
```bash |
|
git clone <your-repo-url> |
|
cd <your-repo-name> |
|
``` |
|
|
|
2. **Create a Python Virtual Environment** |
|
|
|
- Using `venv`: |
|
```bash |
|
python -m venv venv |
|
source venv/bin/activate # On Windows: venv\Scripts\activate |
|
``` |
|
- Or using `conda`: |
|
```bash |
|
conda create -n voicebot_env python=3.10 # Or your preferred Python 3.9+ |
|
conda activate voicebot_env |
|
``` |
|
|
|
3. **Install Dependencies** |
|
|
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
Ensure you have `ffmpeg` installed on your system, as Whisper requires it. |
|
(e.g., `sudo apt update && sudo apt install ffmpeg` on Debian/Ubuntu) |
|
|
|
4. **Set Environment Variables:** |
|
- **Gemini API Key:** Obtain an API key from [Google AI Studio](https://aistudio.google.com/). Set it as an environment variable: |
|
```bash |
|
export GEMINI_API_KEY="YOUR_ACTUAL_GEMINI_API_KEY" |
|
``` |
|
(For Windows PowerShell: `$env:GEMINI_API_KEY="YOUR_ACTUAL_GEMINI_API_KEY"`) |
|
- **(Optional) Whisper Model Size:** |
|
```bash |
|
export WHISPER_MODEL_SIZE="base" # (e.g., tiny, base, small, medium, large) |
|
``` |
|
Defaults to "base" if not set. |
|
|
|
### HTTP RESTful Endpoints |
|
|
|
These are standard FastAPI path operations for testing individual components: |
|
|
|
- **`POST /api/stt`**: Upload an audio file to get its transcription. |
|
- **`POST /api/llm`**: Send text in a JSON payload to get a response from Gemini. |
|
- **`POST /api/tts`**: Send text in a JSON payload to get synthesized audio (non-streaming for this HTTP endpoint, returns base64 encoded WAV). |
|
|
|
### WebSocket Endpoint: `/ws/conversation` |
|
|
|
This is the primary endpoint for real-time, bidirectional conversational interaction: |
|
|
|
- `@app.websocket("/ws/conversation")` defines the WebSocket route. |
|
- **Connection Handling:** Accepts new WebSocket connections. |
|
- **Main Interaction Loop:** |
|
1. **Receive Audio:** Waits to receive audio data (bytes) from the client (`await websocket.receive_bytes()`). |
|
2. **STT:** Calls `transcribe_audio_bytes()` to get text from the user's audio. Sends `USER_TRANSCRIPT: <text>` back to the client. |
|
3. **LLM:** Calls `generate_gemini_response()` with the transcribed text. Sends `ASSISTANT_RESPONSE_TEXT: <text>` back to the client. |
|
4. **Streaming TTS:** |
|
- Sends a `TTS_STREAM_START: {<audio_params>}` message to the client, informing it about the sample rate, channels, and bit depth of the upcoming audio stream. |
|
- Iterates through the `synthesize_speech_streaming()` asynchronous generator. |
|
- For each `audio_chunk_bytes` yielded, it sends these raw audio bytes to the client using `await websocket.send_bytes()`. |
|
- If `websocket.send_bytes()` fails (e.g., client disconnected), the loop breaks, and the `cancellation_event` is set to signal the TTS thread. |
|
- After the stream is complete (or cancelled), it sends a `TTS_STREAM_END` message. |
|
- **Error Handling:** Includes `try...except WebSocketDisconnect` to handle client disconnections gracefully and a general exception handler. |
|
- **Cleanup:** The `finally` block ensures the `cancellation_event` for TTS is set and attempts to close the WebSocket. |
|
|
|
## How to Run |
|
|
|
1. Ensure all setup steps (environment, dependencies, API key) are complete. |
|
2. Execute the script: |
|
```bash |
|
python main.py |
|
``` |
|
Or, for development with auto-reload: |
|
```bash |
|
uvicorn main:app --reload --host 0.0.0.0 --port 8000 |
|
``` |
|
3. The server will start, and you should see logs indicating that models are being loaded. |
|
|