File size: 5,419 Bytes
3c4b548
c2ac364
 
3c4b548
c2ac364
3c4b548
c2ac364
3c4b548
c2ac364
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
title: voice-assistant
app_file: gradio_app.py
sdk: gradio
sdk_version: 5.29.1
---
# Real-time Conversational AI Chatbot Backend

This project implements a Python-based backend for a real-time conversational AI chatbot. It features Speech-to-Text (STT), Language Model (LLM) processing via Google's Gemini API, and streaming Text-to-Speech (TTS) capabilities, all orchestrated through a FastAPI web server with WebSocket support for interactive conversations.

## Core Features

- **Speech-to-Text (STT):** Utilizes OpenAI's Whisper model to transcribe user's spoken audio into text.
- **Language Model (LLM):** Integrates with Google's Gemini API (e.g., `gemini-1.5-flash-latest`) for generating intelligent and contextual responses.
- **Text-to-Speech (TTS) with Streaming:** Employs AI4Bharat's IndicParler-TTS model (via `parler-tts` library) with `ParlerTTSStreamer` to convert the LLM's text response into audible speech, streamed chunk by chunk for faster time-to-first-audio.
- **Real-time Interaction:** A WebSocket endpoint (`/ws/conversation`) manages the live, bidirectional flow of audio and text data between the client and server.
- **Component Testing:** Includes individual HTTP RESTful endpoints for testing STT, LLM, and TTS functionalities separately.
- **Basic Client Demo:** Provides a simple HTML/JavaScript client served at the root (`/`) for demonstrating the WebSocket conversation flow.

## Technologies Used

- **Backend Framework:** FastAPI
- **ASR (STT):** OpenAI Whisper
- **LLM:** Google Gemini API (via `google-generativeai` SDK)
- **TTS:** AI4Bharat IndicParler-TTS (via `parler-tts` and `transformers`)
- **Audio Processing:** `soundfile`, `librosa`
- **Async & Concurrency:** `asyncio`, `threading` (for ParlerTTSStreamer)
- **ML/DL:** PyTorch
- **Web Server:** Uvicorn

## Setup and Installation

1.  **Clone the Repository (if applicable)**

    ```bash
    git clone <your-repo-url>
    cd <your-repo-name>
    ```

2.  **Create a Python Virtual Environment**

    - Using `venv`:
      ```bash
      python -m venv venv
      source venv/bin/activate  # On Windows: venv\Scripts\activate
      ```
    - Or using `conda`:
      ```bash
      conda create -n voicebot_env python=3.10  # Or your preferred Python 3.9+
      conda activate voicebot_env
      ```

3.  **Install Dependencies**

    ```bash
    pip install -r requirements.txt
    ```

    Ensure you have `ffmpeg` installed on your system, as Whisper requires it.
    (e.g., `sudo apt update && sudo apt install ffmpeg` on Debian/Ubuntu)

4.  **Set Environment Variables:**
    - **Gemini API Key:** Obtain an API key from [Google AI Studio](https://aistudio.google.com/). Set it as an environment variable:
      ```bash
      export GEMINI_API_KEY="YOUR_ACTUAL_GEMINI_API_KEY"
      ```
      (For Windows PowerShell: `$env:GEMINI_API_KEY="YOUR_ACTUAL_GEMINI_API_KEY"`)
    - **(Optional) Whisper Model Size:**
      ```bash
      export WHISPER_MODEL_SIZE="base" # (e.g., tiny, base, small, medium, large)
      ```
      Defaults to "base" if not set.

### HTTP RESTful Endpoints

These are standard FastAPI path operations for testing individual components:

- **`POST /api/stt`**: Upload an audio file to get its transcription.
- **`POST /api/llm`**: Send text in a JSON payload to get a response from Gemini.
- **`POST /api/tts`**: Send text in a JSON payload to get synthesized audio (non-streaming for this HTTP endpoint, returns base64 encoded WAV).

### WebSocket Endpoint: `/ws/conversation`

This is the primary endpoint for real-time, bidirectional conversational interaction:

- `@app.websocket("/ws/conversation")` defines the WebSocket route.
- **Connection Handling:** Accepts new WebSocket connections.
- **Main Interaction Loop:**
  1.  **Receive Audio:** Waits to receive audio data (bytes) from the client (`await websocket.receive_bytes()`).
  2.  **STT:** Calls `transcribe_audio_bytes()` to get text from the user's audio. Sends `USER_TRANSCRIPT: <text>` back to the client.
  3.  **LLM:** Calls `generate_gemini_response()` with the transcribed text. Sends `ASSISTANT_RESPONSE_TEXT: <text>` back to the client.
  4.  **Streaming TTS:**
      - Sends a `TTS_STREAM_START: {<audio_params>}` message to the client, informing it about the sample rate, channels, and bit depth of the upcoming audio stream.
      - Iterates through the `synthesize_speech_streaming()` asynchronous generator.
      - For each `audio_chunk_bytes` yielded, it sends these raw audio bytes to the client using `await websocket.send_bytes()`.
      - If `websocket.send_bytes()` fails (e.g., client disconnected), the loop breaks, and the `cancellation_event` is set to signal the TTS thread.
      - After the stream is complete (or cancelled), it sends a `TTS_STREAM_END` message.
- **Error Handling:** Includes `try...except WebSocketDisconnect` to handle client disconnections gracefully and a general exception handler.
- **Cleanup:** The `finally` block ensures the `cancellation_event` for TTS is set and attempts to close the WebSocket.

## How to Run

1.  Ensure all setup steps (environment, dependencies, API key) are complete.
2.  Execute the script:
    ```bash
    python main.py
    ```
    Or, for development with auto-reload:
    ```bash
    uvicorn main:app --reload --host 0.0.0.0 --port 8000
    ```
3.  The server will start, and you should see logs indicating that models are being loaded.