A newer version of the Gradio SDK is available:
5.32.1
title: voice-assistant
app_file: gradio_app.py
sdk: gradio
sdk_version: 5.29.1
Real-time Conversational AI Chatbot Backend
This project implements a Python-based backend for a real-time conversational AI chatbot. It features Speech-to-Text (STT), Language Model (LLM) processing via Google's Gemini API, and streaming Text-to-Speech (TTS) capabilities, all orchestrated through a FastAPI web server with WebSocket support for interactive conversations.
Core Features
- Speech-to-Text (STT): Utilizes OpenAI's Whisper model to transcribe user's spoken audio into text.
- Language Model (LLM): Integrates with Google's Gemini API (e.g.,
gemini-1.5-flash-latest
) for generating intelligent and contextual responses. - Text-to-Speech (TTS) with Streaming: Employs AI4Bharat's IndicParler-TTS model (via
parler-tts
library) withParlerTTSStreamer
to convert the LLM's text response into audible speech, streamed chunk by chunk for faster time-to-first-audio. - Real-time Interaction: A WebSocket endpoint (
/ws/conversation
) manages the live, bidirectional flow of audio and text data between the client and server. - Component Testing: Includes individual HTTP RESTful endpoints for testing STT, LLM, and TTS functionalities separately.
- Basic Client Demo: Provides a simple HTML/JavaScript client served at the root (
/
) for demonstrating the WebSocket conversation flow.
Technologies Used
- Backend Framework: FastAPI
- ASR (STT): OpenAI Whisper
- LLM: Google Gemini API (via
google-generativeai
SDK) - TTS: AI4Bharat IndicParler-TTS (via
parler-tts
andtransformers
) - Audio Processing:
soundfile
,librosa
- Async & Concurrency:
asyncio
,threading
(for ParlerTTSStreamer) - ML/DL: PyTorch
- Web Server: Uvicorn
Setup and Installation
Clone the Repository (if applicable)
git clone <your-repo-url> cd <your-repo-name>
Create a Python Virtual Environment
- Using
venv
:python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Or using
conda
:conda create -n voicebot_env python=3.10 # Or your preferred Python 3.9+ conda activate voicebot_env
- Using
Install Dependencies
pip install -r requirements.txt
Ensure you have
ffmpeg
installed on your system, as Whisper requires it. (e.g.,sudo apt update && sudo apt install ffmpeg
on Debian/Ubuntu)Set Environment Variables:
- Gemini API Key: Obtain an API key from Google AI Studio. Set it as an environment variable:
(For Windows PowerShell:export GEMINI_API_KEY="YOUR_ACTUAL_GEMINI_API_KEY"
$env:GEMINI_API_KEY="YOUR_ACTUAL_GEMINI_API_KEY"
) - (Optional) Whisper Model Size:
Defaults to "base" if not set.export WHISPER_MODEL_SIZE="base" # (e.g., tiny, base, small, medium, large)
- Gemini API Key: Obtain an API key from Google AI Studio. Set it as an environment variable:
HTTP RESTful Endpoints
These are standard FastAPI path operations for testing individual components:
POST /api/stt
: Upload an audio file to get its transcription.POST /api/llm
: Send text in a JSON payload to get a response from Gemini.POST /api/tts
: Send text in a JSON payload to get synthesized audio (non-streaming for this HTTP endpoint, returns base64 encoded WAV).
WebSocket Endpoint: /ws/conversation
This is the primary endpoint for real-time, bidirectional conversational interaction:
@app.websocket("/ws/conversation")
defines the WebSocket route.- Connection Handling: Accepts new WebSocket connections.
- Main Interaction Loop:
- Receive Audio: Waits to receive audio data (bytes) from the client (
await websocket.receive_bytes()
). - STT: Calls
transcribe_audio_bytes()
to get text from the user's audio. SendsUSER_TRANSCRIPT: <text>
back to the client. - LLM: Calls
generate_gemini_response()
with the transcribed text. SendsASSISTANT_RESPONSE_TEXT: <text>
back to the client. - Streaming TTS:
- Sends a
TTS_STREAM_START: {<audio_params>}
message to the client, informing it about the sample rate, channels, and bit depth of the upcoming audio stream. - Iterates through the
synthesize_speech_streaming()
asynchronous generator. - For each
audio_chunk_bytes
yielded, it sends these raw audio bytes to the client usingawait websocket.send_bytes()
. - If
websocket.send_bytes()
fails (e.g., client disconnected), the loop breaks, and thecancellation_event
is set to signal the TTS thread. - After the stream is complete (or cancelled), it sends a
TTS_STREAM_END
message.
- Sends a
- Receive Audio: Waits to receive audio data (bytes) from the client (
- Error Handling: Includes
try...except WebSocketDisconnect
to handle client disconnections gracefully and a general exception handler. - Cleanup: The
finally
block ensures thecancellation_event
for TTS is set and attempts to close the WebSocket.
How to Run
- Ensure all setup steps (environment, dependencies, API key) are complete.
- Execute the script:
Or, for development with auto-reload:python main.py
uvicorn main:app --reload --host 0.0.0.0 --port 8000
- The server will start, and you should see logs indicating that models are being loaded.