voice-assistant / README.md
wishitwerethe90s's picture
Upload folder using huggingface_hub
c2ac364 verified

A newer version of the Gradio SDK is available: 5.32.1

Upgrade
metadata
title: voice-assistant
app_file: gradio_app.py
sdk: gradio
sdk_version: 5.29.1

Real-time Conversational AI Chatbot Backend

This project implements a Python-based backend for a real-time conversational AI chatbot. It features Speech-to-Text (STT), Language Model (LLM) processing via Google's Gemini API, and streaming Text-to-Speech (TTS) capabilities, all orchestrated through a FastAPI web server with WebSocket support for interactive conversations.

Core Features

  • Speech-to-Text (STT): Utilizes OpenAI's Whisper model to transcribe user's spoken audio into text.
  • Language Model (LLM): Integrates with Google's Gemini API (e.g., gemini-1.5-flash-latest) for generating intelligent and contextual responses.
  • Text-to-Speech (TTS) with Streaming: Employs AI4Bharat's IndicParler-TTS model (via parler-tts library) with ParlerTTSStreamer to convert the LLM's text response into audible speech, streamed chunk by chunk for faster time-to-first-audio.
  • Real-time Interaction: A WebSocket endpoint (/ws/conversation) manages the live, bidirectional flow of audio and text data between the client and server.
  • Component Testing: Includes individual HTTP RESTful endpoints for testing STT, LLM, and TTS functionalities separately.
  • Basic Client Demo: Provides a simple HTML/JavaScript client served at the root (/) for demonstrating the WebSocket conversation flow.

Technologies Used

  • Backend Framework: FastAPI
  • ASR (STT): OpenAI Whisper
  • LLM: Google Gemini API (via google-generativeai SDK)
  • TTS: AI4Bharat IndicParler-TTS (via parler-tts and transformers)
  • Audio Processing: soundfile, librosa
  • Async & Concurrency: asyncio, threading (for ParlerTTSStreamer)
  • ML/DL: PyTorch
  • Web Server: Uvicorn

Setup and Installation

  1. Clone the Repository (if applicable)

    git clone <your-repo-url>
    cd <your-repo-name>
    
  2. Create a Python Virtual Environment

    • Using venv:
      python -m venv venv
      source venv/bin/activate  # On Windows: venv\Scripts\activate
      
    • Or using conda:
      conda create -n voicebot_env python=3.10  # Or your preferred Python 3.9+
      conda activate voicebot_env
      
  3. Install Dependencies

    pip install -r requirements.txt
    

    Ensure you have ffmpeg installed on your system, as Whisper requires it. (e.g., sudo apt update && sudo apt install ffmpeg on Debian/Ubuntu)

  4. Set Environment Variables:

    • Gemini API Key: Obtain an API key from Google AI Studio. Set it as an environment variable:
      export GEMINI_API_KEY="YOUR_ACTUAL_GEMINI_API_KEY"
      
      (For Windows PowerShell: $env:GEMINI_API_KEY="YOUR_ACTUAL_GEMINI_API_KEY")
    • (Optional) Whisper Model Size:
      export WHISPER_MODEL_SIZE="base" # (e.g., tiny, base, small, medium, large)
      
      Defaults to "base" if not set.

HTTP RESTful Endpoints

These are standard FastAPI path operations for testing individual components:

  • POST /api/stt: Upload an audio file to get its transcription.
  • POST /api/llm: Send text in a JSON payload to get a response from Gemini.
  • POST /api/tts: Send text in a JSON payload to get synthesized audio (non-streaming for this HTTP endpoint, returns base64 encoded WAV).

WebSocket Endpoint: /ws/conversation

This is the primary endpoint for real-time, bidirectional conversational interaction:

  • @app.websocket("/ws/conversation") defines the WebSocket route.
  • Connection Handling: Accepts new WebSocket connections.
  • Main Interaction Loop:
    1. Receive Audio: Waits to receive audio data (bytes) from the client (await websocket.receive_bytes()).
    2. STT: Calls transcribe_audio_bytes() to get text from the user's audio. Sends USER_TRANSCRIPT: <text> back to the client.
    3. LLM: Calls generate_gemini_response() with the transcribed text. Sends ASSISTANT_RESPONSE_TEXT: <text> back to the client.
    4. Streaming TTS:
      • Sends a TTS_STREAM_START: {<audio_params>} message to the client, informing it about the sample rate, channels, and bit depth of the upcoming audio stream.
      • Iterates through the synthesize_speech_streaming() asynchronous generator.
      • For each audio_chunk_bytes yielded, it sends these raw audio bytes to the client using await websocket.send_bytes().
      • If websocket.send_bytes() fails (e.g., client disconnected), the loop breaks, and the cancellation_event is set to signal the TTS thread.
      • After the stream is complete (or cancelled), it sends a TTS_STREAM_END message.
  • Error Handling: Includes try...except WebSocketDisconnect to handle client disconnections gracefully and a general exception handler.
  • Cleanup: The finally block ensures the cancellation_event for TTS is set and attempts to close the WebSocket.

How to Run

  1. Ensure all setup steps (environment, dependencies, API key) are complete.
  2. Execute the script:
    python main.py
    
    Or, for development with auto-reload:
    uvicorn main:app --reload --host 0.0.0.0 --port 8000
    
  3. The server will start, and you should see logs indicating that models are being loaded.