Spaces:

wishitwerethe90s
/

voice-assistant

Running

App Files Files Community

wishitwerethe90s commited on 13 days ago

Commit

c2ac364

verified ·

1 Parent(s): 3c4b548

Upload folder using huggingface_hub

Browse files

Files changed (10) hide show

.gradio/certificate.pem +31 -0
README.md +108 -8
__pycache__/main.cpython-310.pyc +0 -0
gradio_app.py +294 -0
infereless.py +14 -0
main.py +740 -0
parler-streaming.py +402 -0
requirements.txt +22 -0
streaming_nb.ipynb +0 -0
test_notebook.ipynb +0 -0

.gradio/certificate.pem ADDED Viewed

	@@ -0,0 +1,31 @@

+-----BEGIN CERTIFICATE-----
+MIIFazCCA1OgAwIBAgIRAIIQz7DSQONZRGPgu2OCiwAwDQYJKoZIhvcNAQELBQAw
+TzELMAkGA1UEBhMCVVMxKTAnBgNVBAoTIEludGVybmV0IFNlY3VyaXR5IFJlc2Vh
+cmNoIEdyb3VwMRUwEwYDVQQDEwxJU1JHIFJvb3QgWDEwHhcNMTUwNjA0MTEwNDM4
+WhcNMzUwNjA0MTEwNDM4WjBPMQswCQYDVQQGEwJVUzEpMCcGA1UEChMgSW50ZXJu
+ZXQgU2VjdXJpdHkgUmVzZWFyY2ggR3JvdXAxFTATBgNVBAMTDElTUkcgUm9vdCBY
+MTCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoCggIBAK3oJHP0FDfzm54rVygc
+h77ct984kIxuPOZXoHj3dcKi/vVqbvYATyjb3miGbESTtrFj/RQSa78f0uoxmyF+
+0TM8ukj13Xnfs7j/EvEhmkvBioZxaUpmZmyPfjxwv60pIgbz5MDmgK7iS4+3mX6U
+A5/TR5d8mUgjU+g4rk8Kb4Mu0UlXjIB0ttov0DiNewNwIRt18jA8+o+u3dpjq+sW
+T8KOEUt+zwvo/7V3LvSye0rgTBIlDHCNAymg4VMk7BPZ7hm/ELNKjD+Jo2FR3qyH
+B5T0Y3HsLuJvW5iB4YlcNHlsdu87kGJ55tukmi8mxdAQ4Q7e2RCOFvu396j3x+UC
+B5iPNgiV5+I3lg02dZ77DnKxHZu8A/lJBdiB3QW0KtZB6awBdpUKD9jf1b0SHzUv
+KBds0pjBqAlkd25HN7rOrFleaJ1/ctaJxQZBKT5ZPt0m9STJEadao0xAH0ahmbWn
+OlFuhjuefXKnEgV4We0+UXgVCwOPjdAvBbI+e0ocS3MFEvzG6uBQE3xDk3SzynTn
+jh8BCNAw1FtxNrQHusEwMFxIt4I7mKZ9YIqioymCzLq9gwQbooMDQaHWBfEbwrbw
+qHyGO0aoSCqI3Haadr8faqU9GY/rOPNk3sgrDQoo//fb4hVC1CLQJ13hef4Y53CI
+rU7m2Ys6xt0nUW7/vGT1M0NPAgMBAAGjQjBAMA4GA1UdDwEB/wQEAwIBBjAPBgNV
+HRMBAf8EBTADAQH/MB0GA1UdDgQWBBR5tFnme7bl5AFzgAiIyBpY9umbbjANBgkq
+hkiG9w0BAQsFAAOCAgEAVR9YqbyyqFDQDLHYGmkgJykIrGF1XIpu+ILlaS/V9lZL
+ubhzEFnTIZd+50xx+7LSYK05qAvqFyFWhfFQDlnrzuBZ6brJFe+GnY+EgPbk6ZGQ
+3BebYhtF8GaV0nxvwuo77x/Py9auJ/GpsMiu/X1+mvoiBOv/2X/qkSsisRcOj/KK
+NFtY2PwByVS5uCbMiogziUwthDyC3+6WVwW6LLv3xLfHTjuCvjHIInNzktHCgKQ5
+ORAzI4JMPJ+GslWYHb4phowim57iaztXOoJwTdwJx4nLCgdNbOhdjsnvzqvHu7Ur
+TkXWStAmzOVyyghqpZXjFaH3pO3JLF+l+/+sKAIuvtd7u+Nxe5AW0wdeRlN8NwdC
+jNPElpzVmbUq4JUagEiuTDkHzsxHpFKVK7q4+63SM1N95R1NbdWhscdCb+ZAJzVc
+oyi3B43njTOQ5yOf+1CceWxG1bQVs5ZufpsMljq4Ui0/1lvh+wjChP4kqKOJ2qxq
+4RgqsahDYVvTH9w7jXbyLeiNdd8XM2w9U/t7y0Ff/9yi0GE44Za4rF2LN9d11TPA
+mRGunUHBcnWEvgJBQl9nJEiU0Zsnvgc/ubhPgXRR4Xq37Z0j4r7g1SgEEzwxA57d
+emyPxgcYxn/eR44/KJ4EBs+lVDR3veyJm+kXQ99b21/+jh5Xos1AnX5iItreGCc=
+-----END CERTIFICATE-----

README.md CHANGED Viewed

@@ -1,12 +1,112 @@
 ---
-title: Voice Assistant
-emoji: 💻
-colorFrom: indigo
-colorTo: yellow
 sdk: gradio
-sdk_version: 5.30.0
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: voice-assistant
+app_file: gradio_app.py
 sdk: gradio
+sdk_version: 5.29.1
 ---
+# Real-time Conversational AI Chatbot Backend
+This project implements a Python-based backend for a real-time conversational AI chatbot. It features Speech-to-Text (STT), Language Model (LLM) processing via Google's Gemini API, and streaming Text-to-Speech (TTS) capabilities, all orchestrated through a FastAPI web server with WebSocket support for interactive conversations.
+## Core Features
+- **Speech-to-Text (STT):** Utilizes OpenAI's Whisper model to transcribe user's spoken audio into text.
+- **Language Model (LLM):** Integrates with Google's Gemini API (e.g., `gemini-1.5-flash-latest`) for generating intelligent and contextual responses.
+- **Text-to-Speech (TTS) with Streaming:** Employs AI4Bharat's IndicParler-TTS model (via `parler-tts` library) with `ParlerTTSStreamer` to convert the LLM's text response into audible speech, streamed chunk by chunk for faster time-to-first-audio.
+- **Real-time Interaction:** A WebSocket endpoint (`/ws/conversation`) manages the live, bidirectional flow of audio and text data between the client and server.
+- **Component Testing:** Includes individual HTTP RESTful endpoints for testing STT, LLM, and TTS functionalities separately.
+- **Basic Client Demo:** Provides a simple HTML/JavaScript client served at the root (`/`) for demonstrating the WebSocket conversation flow.
+## Technologies Used
+- **Backend Framework:** FastAPI
+- **ASR (STT):** OpenAI Whisper
+- **LLM:** Google Gemini API (via `google-generativeai` SDK)
+- **TTS:** AI4Bharat IndicParler-TTS (via `parler-tts` and `transformers`)
+- **Audio Processing:** `soundfile`, `librosa`
+- **Async & Concurrency:** `asyncio`, `threading` (for ParlerTTSStreamer)
+- **ML/DL:** PyTorch
+- **Web Server:** Uvicorn
+## Setup and Installation
+1.  **Clone the Repository (if applicable)**
+    ```bash
+    git clone <your-repo-url>
+    cd <your-repo-name>
+    ```
+2.  **Create a Python Virtual Environment**
+    - Using `venv`:
+      ```bash
+      python -m venv venv
+      source venv/bin/activate  # On Windows: venv\Scripts\activate
+      ```
+    - Or using `conda`:
+      ```bash
+      conda create -n voicebot_env python=3.10  # Or your preferred Python 3.9+
+      conda activate voicebot_env
+      ```
+3.  **Install Dependencies**
+    ```bash
+    pip install -r requirements.txt
+    ```
+    Ensure you have `ffmpeg` installed on your system, as Whisper requires it.
+    (e.g., `sudo apt update && sudo apt install ffmpeg` on Debian/Ubuntu)
+4.  **Set Environment Variables:**
+    - **Gemini API Key:** Obtain an API key from [Google AI Studio](https://aistudio.google.com/). Set it as an environment variable:
+      ```bash
+      export GEMINI_API_KEY="YOUR_ACTUAL_GEMINI_API_KEY"
+      ```
+      (For Windows PowerShell: `$env:GEMINI_API_KEY="YOUR_ACTUAL_GEMINI_API_KEY"`)
+    - **(Optional) Whisper Model Size:**
+      ```bash
+      export WHISPER_MODEL_SIZE="base" # (e.g., tiny, base, small, medium, large)
+      ```
+      Defaults to "base" if not set.
+### HTTP RESTful Endpoints
+These are standard FastAPI path operations for testing individual components:
+- **`POST /api/stt`**: Upload an audio file to get its transcription.
+- **`POST /api/llm`**: Send text in a JSON payload to get a response from Gemini.
+- **`POST /api/tts`**: Send text in a JSON payload to get synthesized audio (non-streaming for this HTTP endpoint, returns base64 encoded WAV).
+### WebSocket Endpoint: `/ws/conversation`
+This is the primary endpoint for real-time, bidirectional conversational interaction:
+- `@app.websocket("/ws/conversation")` defines the WebSocket route.
+- **Connection Handling:** Accepts new WebSocket connections.
+- **Main Interaction Loop:**
+  1.  **Receive Audio:** Waits to receive audio data (bytes) from the client (`await websocket.receive_bytes()`).
+  2.  **STT:** Calls `transcribe_audio_bytes()` to get text from the user's audio. Sends `USER_TRANSCRIPT: <text>` back to the client.
+  3.  **LLM:** Calls `generate_gemini_response()` with the transcribed text. Sends `ASSISTANT_RESPONSE_TEXT: <text>` back to the client.
+  4.  **Streaming TTS:**
+      - Sends a `TTS_STREAM_START: {<audio_params>}` message to the client, informing it about the sample rate, channels, and bit depth of the upcoming audio stream.
+      - Iterates through the `synthesize_speech_streaming()` asynchronous generator.
+      - For each `audio_chunk_bytes` yielded, it sends these raw audio bytes to the client using `await websocket.send_bytes()`.
+      - If `websocket.send_bytes()` fails (e.g., client disconnected), the loop breaks, and the `cancellation_event` is set to signal the TTS thread.
+      - After the stream is complete (or cancelled), it sends a `TTS_STREAM_END` message.
+- **Error Handling:** Includes `try...except WebSocketDisconnect` to handle client disconnections gracefully and a general exception handler.
+- **Cleanup:** The `finally` block ensures the `cancellation_event` for TTS is set and attempts to close the WebSocket.
+## How to Run
+1.  Ensure all setup steps (environment, dependencies, API key) are complete.
+2.  Execute the script:
+    ```bash
+    python main.py
+    ```
+    Or, for development with auto-reload:
+    ```bash
+    uvicorn main:app --reload --host 0.0.0.0 --port 8000
+    ```
+3.  The server will start, and you should see logs indicating that models are being loaded.

__pycache__/main.cpython-310.pyc ADDED Viewed

Binary file (31.2 kB). View file

gradio_app.py ADDED Viewed

	@@ -0,0 +1,294 @@

+# gradio_app.py
+import gradio as gr
+import io
+import os
+import torch
+from parler_tts import ParlerTTSForConditionalGeneration
+from transformers import AutoTokenizer, AutoModel # CHANGED: Using AutoModel as per model card
+import numpy as np
+import google.generativeai as genai
+import asyncio
+import librosa
+import torchaudio # Often used by models like this for audio loading/processing internally or as input type
+# --- Configuration ---
+ASR_MODEL_NAME = "ai4bharat/indic-conformer-600m-multilingual"
+TARGET_SAMPLE_RATE = 16000 # Model expects 16kHz
+TTS_MODEL_NAME = "ai4bharat/indic-parler-tts"
+GEMINI_API_KEY = os.getenv("GEMINI_API_KEY", "AIzaSyD6x3Yoby4eQ6QL2kaaG_Rz3fG3rh7wPB8")
+GEMINI_MODEL_NAME_GRADIO = "gemini-1.5-flash-latest"
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+# torch_dtype for ParlerTTS, Gemini etc. For ASR model, it might handle its own precision.
+# --- Global Model Variables ---
+asr_model_gradio = None # This will be the AutoModel instance
+gemini_model_instance_gradio = None
+tts_model_gradio = None
+tts_tokenizer_gradio = None # For ParlerTTS
+# --- Model Loading & API Configuration ---
+def load_all_resources_gradio():
+    global asr_model_gradio, tts_model_gradio, tts_tokenizer_gradio, gemini_model_instance_gradio
+    print(f"Gradio: Loading resources. ASR will be on device: {DEVICE}")
+    if asr_model_gradio is None:
+        print(f"Gradio: Loading ASR model: {ASR_MODEL_NAME} using AutoModel")
+        try:
+            # Load using AutoModel as per the model card's implication
+            asr_model_gradio = AutoModel.from_pretrained(ASR_MODEL_NAME, trust_remote_code=True)
+            asr_model_gradio.to(DEVICE) # Move model to device
+            # The model might handle its own precision (e.g. .half()) internally if `trust_remote_code` allows
+            # Or you might need to call asr_model_gradio.half() if it supports it and you're on CUDA.
+            if DEVICE == "cuda" and hasattr(asr_model_gradio, 'half'):
+                print("Gradio: Applying .half() to ASR model.")
+                asr_model_gradio.half()
+            asr_model_gradio.eval()
+            print(f"Gradio: ASR model ({ASR_MODEL_NAME}) loaded using AutoModel.")
+        except Exception as e:
+            print(f"Gradio: Failed to load ASR model {ASR_MODEL_NAME} using AutoModel: {e}")
+            import traceback
+            traceback.print_exc()
+            asr_model_gradio = None
+    if tts_model_gradio is None: # ParlerTTS loading
+        print(f"Gradio: Loading IndicParler-TTS model: {TTS_MODEL_NAME}")
+        # Ensure ParlerTTS specific tokenizer is loaded for TTS
+        # Note: ASR model might have its own internal tokenizer/processor handled by its custom code
+        tts_parler_tokenizer = AutoTokenizer.from_pretrained(TTS_MODEL_NAME, trust_remote_code=True)
+        tts_model_gradio = ParlerTTSForConditionalGeneration.from_pretrained(TTS_MODEL_NAME, trust_remote_code=True).to(DEVICE)
+        tts_tokenizer_gradio = tts_parler_tokenizer
+        print("Gradio: IndicParler-TTS model loaded.")
+    if gemini_model_instance_gradio is None: # Gemini loading
+        if not GEMINI_API_KEY:
+            print("Gradio: GEMINI_API_KEY not found. LLM functionality via Gemini will be limited.")
+        else:
+            try:
+                genai.configure(api_key=GEMINI_API_KEY)
+                gemini_model_instance_gradio = genai.GenerativeModel(GEMINI_MODEL_NAME_GRADIO)
+                print(f"Gradio: Gemini API configured with model: {GEMINI_MODEL_NAME_GRADIO}")
+            except Exception as e:
+                print(f"Gradio: Failed to configure Gemini API: {e}")
+                gemini_model_instance_gradio = None
+    print("Gradio: All resources loaded (or attempted).")
+# --- Helper Functions ---
+def transcribe_audio_gradio(audio_input_tuple):
+    if asr_model_gradio is None:
+        return f"Error: ASR model ({ASR_MODEL_NAME}) not loaded."
+    if audio_input_tuple is None:
+        print("Gradio: No audio provided to transcribe_audio_gradio.")
+        return "No audio provided."
+    sample_rate, audio_numpy = audio_input_tuple
+    if audio_numpy is None or audio_numpy.size == 0:
+        print("Gradio: Audio numpy array is empty.")
+        return "Empty audio received."
+    # Ensure audio is mono float32, which is a common expectation
+    if audio_numpy.ndim > 1:
+        if audio_numpy.shape[0] == 2 and audio_numpy.ndim == 2:
+            audio_numpy = librosa.to_mono(audio_numpy)
+        elif audio_numpy.shape[1] == 2 and audio_numpy.ndim == 2:
+            audio_numpy = np.mean(audio_numpy, axis=1)
+    if audio_numpy.dtype != np.float32:
+        if np.issubdtype(audio_numpy.dtype, np.integer):
+            audio_numpy = audio_numpy.astype(np.float32) / np.iinfo(audio_numpy.dtype).max
+        else:
+            audio_numpy = audio_numpy.astype(np.float32)
+    # Resample to TARGET_SAMPLE_RATE (16kHz)
+    if sample_rate != TARGET_SAMPLE_RATE:
+        print(f"Gradio: Resampling audio from {sample_rate} Hz to {TARGET_SAMPLE_RATE} Hz.")
+        try:
+            audio_numpy = librosa.resample(y=audio_numpy, orig_sr=sample_rate, target_sr=TARGET_SAMPLE_RATE)
+            # After resampling, the audio_numpy is at TARGET_SAMPLE_RATE
+        except Exception as e:
+            print(f"Gradio: Error during resampling: {e}")
+            return f"Error during audio resampling: {str(e)}"
+    try:
+        print(f"Gradio: Preparing to transcribe with {ASR_MODEL_NAME}. Input audio shape: {audio_numpy.shape}")
+        # The model card example `model(wav, "hi", "ctc")` implies it might take a waveform tensor.
+        # We have a numpy array. We need to convert it to a PyTorch tensor.
+        # The model card uses torchaudio.load which returns a tensor.
+        # Let's convert our numpy array to a tensor and ensure it's on the correct device.
+        # Ensure the audio_numpy is 1D as expected by many ASR models for a single channel
+        if audio_numpy.ndim > 1:
+            audio_numpy = audio_numpy.squeeze() # Attempt to remove singleton dimensions
+        if audio_numpy.ndim > 1 : # If still more than 1D, problem
+            print(f"Gradio: Audio numpy array for ASR has unexpected dimensions after processing: {audio_numpy.shape}")
+            return "Error: Audio processing resulted in unexpected dimensions."
+        wav_tensor = torch.from_numpy(audio_numpy).to(DEVICE)
+        # The model might expect a batch dimension, e.g., [1, num_samples]
+        if wav_tensor.ndim == 1:
+            wav_tensor = wav_tensor.unsqueeze(0)
+        print(f"Gradio: Transcribing with {ASR_MODEL_NAME} using CTC. Input tensor shape: {wav_tensor.shape}")
+        # Perform ASR with CTC decoding (you can choose "rnnt" if preferred and supported)
+        # The language code "hi" is for Hindi. You might want to make this configurable
+        # or see if the model supports language auto-detection if you pass None or omit it.
+        # For now, assuming "hi" or that the model handles mixed language if lang_id is not strictly enforced.
+        # The model card doesn't specify if language ID is optional or how auto-detection works.
+        # Let's try "auto" or a common language like "en" or "hi" to start.
+        # The model card indicates training on 22 languages, so it's multilingual.
+        # If language_id is required, you'll need to provide it.
+        # Let's assume for now we try with a common Indian language or let the model try to auto-detect if "auto" or None is valid.
+        # The snippet "model(wav, "hi", "ctc")" is specific.
+        # The `model()` call is synchronous. Gradio handles this in a thread.
+        with torch.no_grad(): # Good practice for inference
+            transcription_result = asr_model_gradio(wav_tensor, "hi", "ctc") # Using lang_id="hi" and strategy="ctc" as per example
+        # The output format needs to be checked. The model card implies it's the transcribed string directly.
+        # It might be a list of transcriptions if batching occurs, or a dict.
+        if isinstance(transcription_result, list) and len(transcription_result) > 0:
+            transcribed_text = transcription_result[0] # Assuming first result for non-batched input
+        elif isinstance(transcription_result, str):
+            transcribed_text = transcription_result
+        else:
+            print(f"Gradio: Unexpected ASR result format: {type(transcription_result)}, value: {transcription_result}")
+            transcribed_text = "ASR result format not recognized."
+        transcribed_text = transcribed_text.strip()
+        print(f"Gradio: Transcription ({ASR_MODEL_NAME}, CTC): {transcribed_text}")
+        return transcribed_text if transcribed_text else "Transcription resulted in empty text."
+    except Exception as e:
+        print(f"Gradio: Error during {ASR_MODEL_NAME} transcription (AutoModel callable): {e}")
+        import traceback
+        traceback.print_exc()
+        return f"Error during transcription ({ASR_MODEL_NAME}): {str(e)}"
+# ... (Gemini LLM and TTS functions remain the same) ...
+def generate_gemini_response_gradio(text_input: str):
+    if not gemini_model_instance_gradio:
+        return "Error: Gemini LLM not configured or API key missing."
+    if not isinstance(text_input, str) or not text_input.strip() or text_input.startswith("Error:") or "No audio provided" in text_input or "Transcription resulted in empty text" in text_input or "Empty audio received" in text_input or "ASR result format not recognized" in text_input:
+        print(f"Gradio: Invalid input to Gemini: '{text_input}'. Skipping LLM response.")
+        return "LLM (Gemini) skipped due to transcription issue or no input."
+    try:
+        print(f"Gradio: Sending to Gemini: '{text_input}'")
+        full_prompt = f"User: {text_input}\nAssistant:"
+        response = gemini_model_instance_gradio.generate_content(full_prompt)
+        response_text = ""
+        if response.candidates and response.candidates[0].content.parts:
+            response_text = response.candidates[0].content.parts[0].text.strip()
+        else:
+            feedback_info = ""
+            if hasattr(response, 'prompt_feedback') and response.prompt_feedback:
+                feedback_info = f" Feedback: {response.prompt_feedback}"
+            print(f"Gradio: Gemini response did not contain expected content.{feedback_info}")
+            response_text = f"I'm sorry, I couldn't generate a response for that (Gemini).{feedback_info}"
+        print(f"Gradio: Gemini LLM Response: {response_text}")
+        return response_text if response_text else "Gemini LLM generated an empty response."
+    except Exception as e:
+        print(f"Gradio: Error during Gemini LLM generation: {e}")
+        import traceback
+        traceback.print_exc()
+        return f"Error during Gemini LLM generation: {str(e)}"
+def synthesize_speech_gradio(text_input: str, description: str = "A clear, female voice speaking in English."):
+    if tts_model_gradio is None or tts_tokenizer_gradio is None:
+        return "Error: TTS model or its tokenizer not loaded."
+    if not isinstance(text_input, str) or not text_input.strip() or text_input.startswith("Error:") or "LLM skipped" in text_input or "generated an empty response" in text_input or "not configured" in text_input or "ASR result format not recognized" in text_input :
+        print(f"Gradio: Invalid input to TTS: '{text_input}'. Skipping synthesis.")
+        return "TTS skipped due to LLM issue or no input."
+    try:
+        print(f"Gradio: Synthesizing speech for: '{text_input}'")
+        description_tokenized = tts_tokenizer_gradio(description, return_tensors="pt", padding=True, truncation=True, max_length=128)
+        description_ids = description_tokenized.input_ids.to(DEVICE)
+        description_attention_mask = description_tokenized.attention_mask.to(DEVICE)
+        prompt_tokenized = tts_tokenizer_gradio(text_input, return_tensors="pt", padding=True, truncation=True, max_length=512)
+        prompt_ids = prompt_tokenized.input_ids.to(DEVICE)
+        if prompt_ids.shape[-1] == 0: # Check if tokenized prompt is empty
+             print(f"Gradio: Tokenized prompt for TTS is empty. Text was: '{text_input}'. Skipping synthesis.")
+             return "TTS skipped: Input text resulted in empty tokens."
+        generation = tts_model_gradio.generate(
+            input_ids=description_ids,
+            attention_mask=description_attention_mask,
+            prompt_input_ids=prompt_ids,
+            do_sample=True, temperature=0.7, top_k=50, top_p=0.95
+        ).cpu().numpy().squeeze()
+        sampling_rate = tts_model_gradio.config.sampling_rate
+        print(f"Gradio: Speech synthesized. Array shape: {generation.shape}, Sample rate: {sampling_rate}")
+        return (sampling_rate, generation)
+    except Exception as e:
+        print(f"Gradio: Error during speech synthesis: {e}")
+        import traceback
+        traceback.print_exc()
+        if "You need to specify either `text` or `text_target`" in str(e):
+             return "Error in TTS: Model requires 'text' or 'text_target'. Input might be too short or problematic."
+        return f"Error during speech synthesis: {str(e)}"
+# --- Gradio Interface Definition ---
+load_all_resources_gradio()
+def full_pipeline_gradio(audio_input):
+    transcribed_text_output = transcribe_audio_gradio(audio_input)
+    print(f"DEBUG full_pipeline_gradio - Step 1 (Transcription): '{transcribed_text_output}' (type: {type(transcribed_text_output)})")
+    llm_response_text_output = generate_gemini_response_gradio(transcribed_text_output)
+    print(f"DEBUG full_pipeline_gradio - Step 2 (LLM Response): '{llm_response_text_output}' (type: {type(llm_response_text_output)})")
+    tts_synthesis_result = synthesize_speech_gradio(llm_response_text_output)
+    final_audio_output = None
+    if isinstance(tts_synthesis_result, tuple) and len(tts_synthesis_result) == 2 and isinstance(tts_synthesis_result[1], np.ndarray):
+        final_audio_output = tts_synthesis_result
+        print(f"DEBUG full_pipeline_gradio - Step 3 (TTS Success): Audio tuple with shape {tts_synthesis_result[1].shape if isinstance(tts_synthesis_result[1], np.ndarray) else 'N/A'}")
+    else:
+        error_message_from_tts = str(tts_synthesis_result) if isinstance(tts_synthesis_result, str) else "TTS synthesis failed or returned unexpected type"
+        print(f"DEBUG full_pipeline_gradio - Step 3 (TTS Failed/Non-audio): {error_message_from_tts}. Providing silent audio.")
+        # Append TTS error to LLM text only if LLM text was valid
+        if llm_response_text_output and not llm_response_text_output.startswith("Error:") and "LLM skipped" not in llm_response_text_output and "ASR result format not recognized" not in llm_response_text_output:
+            llm_response_text_output = f"{llm_response_text_output} | (TTS Problem: {error_message_from_tts})"
+        elif not llm_response_text_output or llm_response_text_output.startswith("Error:") or "LLM skipped" in llm_response_text_output or "ASR result format not recognized" in llm_response_text_output:
+             # If LLM already had an error, just keep that error, maybe note TTS also had an issue
+             llm_response_text_output = f"{llm_response_text_output} (TTS also had an issue: {error_message_from_tts})"
+        default_sample_rate = tts_model_gradio.config.sampling_rate if tts_model_gradio and hasattr(tts_model_gradio, 'config') else TARGET_SAMPLE_RATE
+        final_audio_output = (default_sample_rate, np.array([0.0], dtype=np.float32))
+        print(f"DEBUG full_pipeline_gradio - Step 3 (TTS Fallback): Silent audio tuple")
+    print(f"DEBUG full_pipeline_gradio - RETURNING: Transcription='{transcribed_text_output}', LLM_Text='{llm_response_text_output}', Audio_Type={type(final_audio_output)}")
+    return transcribed_text_output, llm_response_text_output, final_audio_output
+with gr.Blocks(title="Conversational AI Demo") as demo:
+    gr.Markdown("# Conversational AI Demo (STT -> Gemini LLM -> TTS)")
+    with gr.Row():
+        audio_in = gr.Audio(sources=["microphone"], type="numpy", label="Speak Here")
+    process_button = gr.Button("Process Audio")
+    with gr.Accordion("Outputs", open=True):
+        transcription_out = gr.Textbox(label="You Said (Transcription)", lines=2)
+        llm_response_out = gr.Textbox(label="Gemini Assistant Says (Text)", lines=5)
+        audio_out = gr.Audio(label="Assistant Says (Audio)")
+    process_button.click(
+        fn=full_pipeline_gradio,
+        inputs=[audio_in],
+        outputs=[transcription_out, llm_response_out, audio_out]
+    )
+    gr.Markdown("---")
+    gr.Markdown("### How to Use:")
+    gr.Markdown("1. Ensure your `GEMINI_API_KEY` environment variable is set.")
+    gr.Markdown("2. Click into the 'Speak Here' box and record your audio.")
+    gr.Markdown("3. Click the 'Process Audio' button.")
+    gr.Markdown("4. View the transcription, Gemini's text response, and listen to the audio response.")
+if __name__ == "__main__":
+    demo.launch(share=False)

infereless.py ADDED Viewed

	@@ -0,0 +1,14 @@

+import requests
+import json
+URL = 'https://serverless-region-v1.inferless.com/api/v1/parler-tts-streaming-1_ae4e81bb5d604799b573df3f0b3c9518/infer'
+headers = {"Content-Type": "application/json", "Authorization": "Bearer 1e01145781a0639d830555d5e4e4e5e1752726750db75e995e0e246f32c4b7c9f442bd6f8caec8acc6b9684ec78e5b633db04370815ca1748bf5a7db80245411"}
+data = json.loads('''{
+  "parameters": {
+    "prompt_value": "A male speaker with a low-pitched voice delivering his words at a fast pace in a small, confined space with a very clear audio and an animated tone.",
+    "input_value": "Remember - this is only the first iteration of the model! To improve the prosody and naturalness of the speech further, we're scaling up the amount of training data by a factor of five times."
+  }
+}''')
+response = requests.post(URL, headers=headers, data=json.dumps(data))
+print(response.json())

main.py ADDED Viewed

	@@ -0,0 +1,740 @@

+# main.py
+import asyncio
+import base64
+import io
+import logging
+import os
+from threading import Thread, Event # Added Event for better thread control
+import time # For timeout checks
+import soundfile as sf
+import torch
+import uvicorn
+import whisper
+from fastapi import FastAPI, File, UploadFile, WebSocket, WebSocketDisconnect
+from fastapi.responses import HTMLResponse, JSONResponse
+from fastapi.middleware.cors import CORSMiddleware
+from parler_tts import ParlerTTSForConditionalGeneration, ParlerTTSStreamer
+from transformers import AutoTokenizer, GenerationConfig # Keep transformers.GenerationConfig
+import google.generativeai as genai
+import numpy as np
+# --- Configuration ---
+WHISPER_MODEL_SIZE = os.getenv("WHISPER_MODEL_SIZE", "tiny")
+TTS_MODEL_NAME = "ai4bharat/indic-parler-tts"
+GEMINI_API_KEY = os.getenv("GEMINI_API_KEY", "AIzaSyD6x3Yoby4eQ6QL2kaaG_Rz3fG3rh7wPB8")
+GEMINI_MODEL_NAME = "gemini-1.5-flash-latest"
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+attn_implementation = "flash_attention_2" if torch.cuda.is_available() else "eager"
+torch_dtype_tts = torch.bfloat16 if DEVICE == "cuda" and torch.cuda.is_bf16_supported() else (torch.float16 if DEVICE == "cuda" else torch.float32)
+torch_dtype_whisper = torch.float16 if DEVICE == "cuda" else torch.float32
+TTS_DEFAULT_PARAMS = {
+    "do_sample": True,
+    "temperature": 1.0,
+    "top_k": 50,
+    "top_p": 0.95,
+    "min_new_tokens": 5, # Reduced for quicker start with streamer
+    # "max_new_tokens": 256, # Optional global cap
+}
+# --- Logging ---
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# --- FastAPI App Initialization ---
+app = FastAPI(title="Conversational AI Chatbot with Enhanced Stream Abortion")
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# --- Global Model Variables ---
+whisper_model = None
+gemini_model_instance = None
+tts_model = None
+tts_tokenizer = None
+# We will build the GenerationConfig object from TTS_DEFAULT_PARAMS inside the functions
+# or store it globally if preferred, initialized from transformers.GenerationConfig
+# --- Model Loading & API Configuration ---
+@app.on_event("startup")
+async def load_resources():
+    global whisper_model, tts_model, tts_tokenizer, gemini_model_instance
+    logger.info(f"Loading local models. Whisper on {DEVICE} with {torch_dtype_whisper}, TTS on {DEVICE} with {torch_dtype_tts}")
+    try:
+        logger.info(f"Loading Whisper model: {WHISPER_MODEL_SIZE}")
+        whisper_model = whisper.load_model(WHISPER_MODEL_SIZE, device=DEVICE)
+        logger.info("Whisper model loaded successfully.")
+        logger.info(f"Loading IndicParler-TTS model: {TTS_MODEL_NAME}")
+        tts_model = ParlerTTSForConditionalGeneration.from_pretrained(TTS_MODEL_NAME, attn_implementation=attn_implementation).to(DEVICE, dtype=torch_dtype_tts)
+        tts_tokenizer = AutoTokenizer.from_pretrained(TTS_MODEL_NAME)
+        if tts_tokenizer:
+            if tts_tokenizer.pad_token_id is not None:
+                TTS_DEFAULT_PARAMS["pad_token_id"] = tts_tokenizer.pad_token_id
+            # ParlerTTS uses a special token_id for silence, not eos_token_id for generation end.
+            # eos_token_id is more for text models.
+            # if tts_tokenizer.eos_token_id is not None:
+            #     TTS_DEFAULT_PARAMS["eos_token_id"] = tts_tokenizer.eos_token_id
+        logger.info(f"IndicParler-TTS model loaded. Default generation params: {TTS_DEFAULT_PARAMS}")
+        if not GEMINI_API_KEY:
+            logger.warning("GEMINI_API_KEY not found. LLM functionality will be limited.")
+        else:
+            try:
+                genai.configure(api_key=GEMINI_API_KEY)
+                gemini_model_instance = genai.GenerativeModel(GEMINI_MODEL_NAME)
+                logger.info(f"Gemini API configured with model: {GEMINI_MODEL_NAME}")
+            except Exception as e:
+                logger.error(f"Failed to configure Gemini API: {e}", exc_info=True)
+                gemini_model_instance = None
+    except Exception as e:
+        logger.error(f"Error loading models: {e}", exc_info=True)
+    logger.info("Local models and API configurations loaded.")
+# --- Helper Functions ---
+async def transcribe_audio_bytes(audio_bytes: bytes) -> str:
+    if not whisper_model:
+        raise RuntimeError("Whisper model not loaded.")
+    temp_audio_path = f"temp_audio_main_{os.urandom(4).hex()}.wav"
+    try:
+        with open(temp_audio_path, "wb") as f:
+            f.write(audio_bytes)
+        result = whisper_model.transcribe(temp_audio_path, fp16=(DEVICE == "cuda" and torch_dtype_whisper == torch.float16))
+        transcribed_text = result["text"].strip()
+        logger.info(f"Transcription: {transcribed_text}")
+        return transcribed_text
+    except Exception as e:
+        logger.error(f"Error during transcription: {e}", exc_info=True)
+        return ""
+    finally:
+        if os.path.exists(temp_audio_path):
+            try:
+                os.remove(temp_audio_path)
+            except Exception as e_del:
+                logger.error(f"Error deleting temp audio file {temp_audio_path}: {e_del}")
+async def generate_gemini_response(text: str) -> str:
+    if not gemini_model_instance:
+        logger.error("Gemini model instance not available.")
+        return "Sorry, the language model is currently unavailable."
+    try:
+        full_prompt = f"User: {text}\nAssistant:"
+        loop = asyncio.get_event_loop()
+        response = await loop.run_in_executor(None, gemini_model_instance.generate_content, full_prompt)
+        response_text = "I'm sorry, I couldn't generate a response for that."
+        if hasattr(response, 'text') and response.text: # For simple text responses
+            response_text = response.text.strip()
+        elif response.parts: # New way to access parts for gemini-1.5-flash and pro
+             response_text = "".join(part.text for part in response.parts).strip()
+        elif response.candidates and response.candidates[0].content.parts: # Older way
+            response_text = response.candidates[0].content.parts[0].text.strip()
+        else:
+            safety_feedback = ""
+            if hasattr(response, 'prompt_feedback') and response.prompt_feedback:
+                 safety_feedback = f" Safety Feedback: {response.prompt_feedback}"
+            elif response.candidates and hasattr(response.candidates[0], 'finish_reason') and response.candidates[0].finish_reason != "STOP":
+                 safety_feedback = f" Finish Reason: {response.candidates[0].finish_reason}"
+            logger.warning(f"Gemini response might be empty or blocked.{safety_feedback}")
+        logger.info(f"Gemini LLM Response: {response_text}")
+        return response_text
+    except Exception as e:
+        logger.error(f"Error during Gemini LLM generation: {e}", exc_info=True)
+        return "Sorry, I encountered an error trying to respond."
+async def synthesize_speech_streaming(text: str, description: str = "A clear, female voice speaking in English.", play_steps_in_s: float = 0.4, cancellation_event: Event = Event()):
+    if not tts_model or not tts_tokenizer:
+        logger.error("TTS model or tokenizer not loaded.")
+        if cancellation_event and cancellation_event.is_set(): logger.info("TTS cancelled before start."); yield b""; return
+        yield b""
+        return
+    if not text or not text.strip():
+        logger.warning("TTS input text is empty. Yielding empty audio.")
+        if cancellation_event and cancellation_event.is_set(): logger.info("TTS cancelled before start (empty text)."); yield b""; return
+        yield b""
+        return
+    streamer = None
+    thread = None
+    try:
+        logger.info(f"Starting TTS streaming with ParlerTTSStreamer for: \"{text[:50]}...\"")
+        # Ensure sampling_rate is correctly accessed from the model's config
+        # For ParlerTTS, it's usually under model.config.audio_encoder.sampling_rate
+        if hasattr(tts_model.config, 'audio_encoder') and hasattr(tts_model.config.audio_encoder, 'sampling_rate'):
+            sampling_rate = tts_model.config.audio_encoder.sampling_rate
+        else:
+            logger.warning("Could not find tts_model.config.audio_encoder.sampling_rate, defaulting to 24000")
+            sampling_rate = 24000 # A common default for ParlerTTS if not found
+        try:
+            frame_rate = getattr(tts_model.config.audio_encoder, 'frame_rate', 100)
+        except AttributeError:
+            logger.warning("frame_rate not found in tts_model.config.audio_encoder. Using default of 100 Hz for play_steps calculation.")
+            frame_rate = 100
+        play_steps = int(frame_rate * play_steps_in_s)
+        if play_steps == 0 : play_steps = 1
+        logger.info(f"Streamer params: sampling_rate={sampling_rate}, frame_rate={frame_rate}, play_steps_in_s={play_steps_in_s}, play_steps={play_steps}")
+        streamer = ParlerTTSStreamer(tts_model, device=DEVICE, play_steps=play_steps)
+        description_inputs = tts_tokenizer(description, return_tensors="pt")
+        prompt_inputs = tts_tokenizer(text, return_tensors="pt")
+        gen_config_dict = TTS_DEFAULT_PARAMS.copy()
+        # ParlerTTS generate method might not take a GenerationConfig object directly,
+        # but rather individual kwargs. The streamer example passes them as kwargs.
+        # We ensure pad_token_id and eos_token_id are set if the tokenizer has them.
+        if tts_tokenizer.pad_token_id is not None:
+            gen_config_dict["pad_token_id"] = tts_tokenizer.pad_token_id
+        # ParlerTTS might not use eos_token_id in the same way as text models.
+        # if tts_tokenizer.eos_token_id is not None:
+        #     gen_config_dict["eos_token_id"] = tts_tokenizer.eos_token_id
+        thread_generation_kwargs = {
+            "input_ids": description_inputs.input_ids.to(DEVICE),
+            "prompt_input_ids": prompt_inputs.input_ids.to(DEVICE),
+            "attention_mask": description_inputs.attention_mask.to(DEVICE) if hasattr(description_inputs, 'attention_mask') else None,
+            "streamer": streamer,
+            **gen_config_dict # Spread the generation parameters
+        }
+        if thread_generation_kwargs["attention_mask"] is None:
+            del thread_generation_kwargs["attention_mask"]
+        def _generate_in_thread():
+            try:
+                logger.info(f"TTS generation thread started.")
+                with torch.no_grad():
+                     tts_model.generate(**thread_generation_kwargs)
+                logger.info("TTS generation thread finished model.generate().")
+            except Exception as e_thread:
+                logger.error(f"Error in TTS generation thread: {e_thread}", exc_info=True)
+            finally:
+                if streamer: streamer.end()
+                logger.info("TTS generation thread called streamer.end().")
+        thread = Thread(target=_generate_in_thread)
+        thread.daemon = True
+        thread.start()
+        loop = asyncio.get_event_loop()
+        while True:
+            if cancellation_event and cancellation_event.is_set():
+                logger.info("TTS streaming cancelled by event.")
+                break
+            try:
+                # Run the blocking streamer.__next__() in an executor
+                audio_chunk_tensor = await loop.run_in_executor(None, streamer.__next__)
+                if audio_chunk_tensor is None:
+                    logger.info("Streamer yielded None explicitly, ending stream.")
+                    break
+                # This check for numel == 0 is important as streamer might yield empty tensors
+                if not isinstance(audio_chunk_tensor, torch.Tensor) or audio_chunk_tensor.numel() == 0:
+                    # REMOVED: if streamer.is_done(): (AttributeError)
+                    # Instead, rely on StopIteration or explicit None from streamer
+                    await asyncio.sleep(0.01) # Small sleep if empty but not done
+                    continue
+                audio_chunk_np = audio_chunk_tensor.cpu().to(torch.float32).numpy().squeeze()
+                if audio_chunk_np.size == 0:
+                    continue
+                audio_chunk_int16 = np.clip(audio_chunk_np * 32767, -32768, 32767).astype(np.int16)
+                yield audio_chunk_int16.tobytes()
+                # No need for sleep here if chunks are substantial, client will process
+                # await asyncio.sleep(0.001) # Can be removed or made very small
+            except StopIteration:
+                logger.info("Streamer finished (StopIteration).")
+                break
+            except Exception as e_stream_iter:
+                logger.error(f"Error iterating streamer: {e_stream_iter}", exc_info=True)
+                break
+        logger.info(f"Finished TTS streaming iteration for: \"{text[:50]}...\"")
+    except Exception as e:
+        logger.error(f"Error in synthesize_speech_streaming function: {e}", exc_info=True)
+        yield b""
+    finally:
+        logger.info("Exiting synthesize_speech_streaming. Ensuring streamer is ended and thread is joined.")
+        if streamer:
+            streamer.end()
+        if thread and thread.is_alive():
+            logger.info("Waiting for TTS generation thread to complete in finally block...")
+            final_join_start_time = time.time()
+            thread.join(timeout=2.0)
+            if thread.is_alive():
+                logger.warning(f"TTS generation thread still alive after {time.time() - final_join_start_time:.2f}s in finally block.")
+# --- FastAPI HTTP Endpoints ---
+@app.post("/api/stt", summary="Speech to Text")
+async def speech_to_text_endpoint(file: UploadFile = File(...)):
+    if not whisper_model:
+        return JSONResponse(content={"error": "Whisper model not loaded"}, status_code=503)
+    try:
+        audio_bytes = await file.read()
+        transcribed_text = await transcribe_audio_bytes(audio_bytes)
+        return {"transcription": transcribed_text}
+    except Exception as e:
+        return JSONResponse(content={"error": str(e)}, status_code=500)
+@app.post("/api/llm", summary="LLM Response Generation (Gemini)")
+async def llm_endpoint(payload: dict):
+    if not gemini_model_instance:
+        return JSONResponse(content={"error": "Gemini LLM not configured or API key missing"}, status_code=503)
+    try:
+        text = payload.get("text")
+        if not text:
+            return JSONResponse(content={"error": "No text provided"}, status_code=400)
+        response = await generate_gemini_response(text)
+        return {"response": response}
+    except Exception as e:
+        return JSONResponse(content={"error": str(e)}, status_code=500)
+@app.post("/api/tts", summary="Text to Speech (Non-Streaming for HTTP)")
+async def text_to_speech_endpoint(payload: dict):
+    if not tts_model or not tts_tokenizer:
+        return JSONResponse(content={"error": "TTS model/tokenizer not loaded"}, status_code=503)
+    try:
+        text = payload.get("text")
+        description = payload.get("description", "A clear, female voice speaking in English.")
+        if not text:
+            return JSONResponse(content={"error": "No text provided"}, status_code=400)
+        description_inputs = tts_tokenizer(description, return_tensors="pt")
+        prompt_inputs = tts_tokenizer(text, return_tensors="pt")
+        # Use a GenerationConfig object for clarity and consistency
+        gen_config_dict = TTS_DEFAULT_PARAMS.copy()
+        if tts_tokenizer.pad_token_id is not None:
+             gen_config_dict["pad_token_id"] = tts_tokenizer.pad_token_id
+        # if tts_tokenizer.eos_token_id is not None: # ParlerTTS might not use standard eos
+        #      gen_config_dict["eos_token_id"] = tts_tokenizer.eos_token_id
+        # Create GenerationConfig from transformers
+        generation_config_obj = GenerationConfig(**gen_config_dict)
+        with torch.no_grad():
+            generation = tts_model.generate(
+                input_ids=description_inputs.input_ids.to(DEVICE),
+                prompt_input_ids=prompt_inputs.input_ids.to(DEVICE),
+                attention_mask=description_inputs.attention_mask.to(DEVICE) if hasattr(description_inputs, 'attention_mask') else None,
+                generation_config=generation_config_obj # Pass the config object
+            ).cpu().to(torch.float32).numpy().squeeze()
+        audio_io = io.BytesIO()
+        scaled_generation = np.clip(generation * 32767, -32768, 32767).astype(np.int16)
+        current_sampling_rate = tts_model.config.audio_encoder.sampling_rate if hasattr(tts_model.config, 'audio_encoder') else 24000
+        sf.write(audio_io, scaled_generation, samplerate=current_sampling_rate, format='WAV', subtype='PCM_16')
+        audio_io.seek(0)
+        audio_bytes = audio_io.read()
+        if not audio_bytes:
+             return JSONResponse(content={"error": "TTS failed to generate audio"}, status_code=500)
+        audio_base64 = base64.b64encode(audio_bytes).decode('utf-8')
+        return {"audio_base64": audio_base64, "format": "wav", "sample_rate": current_sampling_rate}
+    except Exception as e:
+        logger.error(f"TTS endpoint error: {e}", exc_info=True)
+        return JSONResponse(content={"error": str(e)}, status_code=500)
+# --- WebSocket Endpoint for Real-time Conversation ---
+@app.websocket("/ws/conversation")
+async def conversation_websocket(websocket: WebSocket):
+    await websocket.accept()
+    logger.info(f"WebSocket connection accepted from: {websocket.client}")
+    tts_cancellation_event = Event() # For this specific connection
+    try:
+        while True:
+            if websocket.client_state.name != 'CONNECTED': # Check if client disconnected before receive
+                logger.info(f"WebSocket client {websocket.client} disconnected before receive.")
+                break
+            audio_data = await websocket.receive_bytes()
+            logger.info(f"Received {len(audio_data)} bytes of user audio data from {websocket.client}.")
+            if not audio_data:
+                logger.warning(f"Received empty audio data from user {websocket.client}.")
+                continue
+            transcribed_text = await transcribe_audio_bytes(audio_data)
+            if not transcribed_text:
+                logger.warning(f"Transcription failed for {websocket.client}.")
+                await websocket.send_text("SYSTEM_ERROR: Transcription failed.")
+                continue
+            await websocket.send_text(f"USER_TRANSCRIPT: {transcribed_text}")
+            llm_response_text = await generate_gemini_response(transcribed_text)
+            if not llm_response_text or "Sorry, I encountered an error" in llm_response_text or "unavailable" in llm_response_text:
+                logger.warning(f"LLM (Gemini) failed for {websocket.client}: {llm_response_text}")
+                await websocket.send_text(f"SYSTEM_ERROR: LLM failed. ({llm_response_text})")
+                continue
+            await websocket.send_text(f"ASSISTANT_RESPONSE_TEXT: {llm_response_text}")
+            tts_description = "A clear, female voice speaking in English."
+            current_sampling_rate = tts_model.config.audio_encoder.sampling_rate if hasattr(tts_model.config, 'audio_encoder') else 24000
+            audio_params_msg = f"TTS_STREAM_START:{{\"sample_rate\": {current_sampling_rate}, \"channels\": 1, \"bit_depth\": 16}}"
+            await websocket.send_text(audio_params_msg)
+            logger.info(f"Sent to client {websocket.client}: {audio_params_msg}")
+            chunk_count = 0
+            tts_cancellation_event.clear() # Reset event for new TTS task
+            async for audio_chunk_bytes in synthesize_speech_streaming(llm_response_text, tts_description, cancellation_event=tts_cancellation_event):
+                if not audio_chunk_bytes:
+                    logger.debug(f"Received empty bytes from streaming generator for {websocket.client}, might be end or error in generator.")
+                    continue
+                try:
+                    if websocket.client_state.name != 'CONNECTED':
+                        logger.warning(f"Client {websocket.client} disconnected during TTS stream. Aborting TTS.")
+                        tts_cancellation_event.set() # Signal TTS thread to stop
+                        break
+                    await websocket.send_bytes(audio_chunk_bytes)
+                    chunk_count += 1
+                except Exception as send_err:
+                    logger.warning(f"Error sending audio chunk to {websocket.client}: {send_err}. Client likely disconnected.")
+                    tts_cancellation_event.set() # Signal TTS thread to stop
+                    break
+            if not tts_cancellation_event.is_set(): # Only send END if not cancelled
+                logger.info(f"Sent {chunk_count} TTS audio chunks to client {websocket.client}.")
+                await websocket.send_text("TTS_STREAM_END")
+                logger.info(f"Sent TTS_STREAM_END to client {websocket.client}.")
+            else:
+                logger.info(f"TTS stream for {websocket.client} was cancelled. Sent {chunk_count} chunks before cancellation.")
+    except WebSocketDisconnect:
+        logger.info(f"WebSocket connection closed by client {websocket.client}.")
+        tts_cancellation_event.set() # Signal any active TTS to stop
+    except Exception as e:
+        logger.error(f"Error in WebSocket conversation with {websocket.client}: {e}", exc_info=True)
+        tts_cancellation_event.set() # Signal any active TTS to stop
+        try:
+            if websocket.client_state.name == 'CONNECTED':
+                await websocket.send_text(f"SYSTEM_ERROR: An unexpected error occurred: {str(e)}")
+        except Exception: pass
+    finally:
+        logger.info(f"Cleaning up WebSocket connection for {websocket.client}.")
+        tts_cancellation_event.set() # Ensure event is set on any exit path
+        if websocket.client_state.name == 'CONNECTED' or websocket.client_state.name == 'CONNECTING':
+           try: await websocket.close()
+           except Exception: pass
+        logger.info(f"WebSocket connection resources cleaned up for {websocket.client}.")
+# ... (HTML serving and main execution block remain the same) ...
+@app.get("/", response_class=HTMLResponse)
+async def get_home():
+    html_content = """
+    <!DOCTYPE html>
+    <html>
+        <head>
+            <title>Conversational AI Chatbot (Streaming)</title>
+            <style>
+                body { font-family: Arial, sans-serif; margin: 20px; background-color: #f4f4f4; color: #333; }
+                #chatbox { width: 80%; max-width: 600px; margin: auto; background-color: #fff; padding: 20px; box-shadow: 0 0 10px rgba(0,0,0,0.1); border-radius: 8px; }
+                .message { padding: 10px; margin-bottom: 10px; border-radius: 5px; }
+                .user { background-color: #e1f5fe; text-align: right; }
+                .assistant { background-color: #f1f8e9; }
+                .system { background-color: #ffebee; color: #c62828; font-style: italic;}
+                #audioPlayerContainer { margin-top: 10px; }
+                #audioPlayer { display: none; width: 100%; }
+                button { padding: 10px 15px; background-color: #007bff; color: white; border: none; border-radius: 5px; cursor: pointer; margin-top:10px; }
+                button:disabled { background-color: #ccc; }
+                #status { margin-top: 10px; font-style: italic; color: #666; }
+                #transcriptionArea, #llmResponseArea { margin-top: 10px; padding: 5px; border: 1px solid #eee; background: #fafafa; word-wrap: break-word;}
+            </style>
+        </head>
+        <body>
+            <div id="chatbox">
+                <h2>Real-time AI Chatbot (Streaming TTS)</h2>
+                <div id="messages"></div>
+                <div id="transcriptionArea"><strong>You (transcribed):</strong> <span id="userTranscript">...</span></div>
+                <div id="llmResponseArea"><strong>Assistant (text):</strong> <span id="assistantTranscript">...</span></div>
+                <button id="startRecordButton">Start Recording</button>
+                <button id="stopRecordButton" disabled>Stop Recording</button>
+                <p id="status">Status: Idle</p>
+                <div id="audioPlayerContainer">
+                    <audio id="audioPlayer" controls></audio>
+                </div>
+            </div>
+            <script>
+                const startRecordButton = document.getElementById('startRecordButton');
+                const stopRecordButton = document.getElementById('stopRecordButton');
+                const audioPlayer = document.getElementById('audioPlayer');
+                const messagesDiv = document.getElementById('messages');
+                const statusDiv = document.getElementById('status');
+                const userTranscriptSpan = document.getElementById('userTranscript');
+                const assistantTranscriptSpan = document.getElementById('assistantTranscript');
+                let websocket;
+                let mediaRecorder;
+                let userAudioChunks = [];
+                let assistantAudioBufferQueue = [];
+                let audioContext;
+                let expectedSampleRate;
+                let ttsStreaming = false;
+                let audioPlaying = false;
+                let sourceNode = null;
+                function initAudioContext() {
+                    if (!audioContext || audioContext.state === 'closed') {
+                        try {
+                            audioContext = new (window.AudioContext || window.webkitAudioContext)();
+                            console.log("AudioContext initialized or re-initialized.");
+                        } catch (e) {
+                            console.error("Web Audio API is not supported in this browser.", e);
+                            addMessage("Error: Web Audio API not supported. Cannot play streamed audio.", "system");
+                            audioContext = null;
+                        }
+                    }
+                }
+                function connectWebSocket() {
+                    const protocol = window.location.protocol === 'https:' ? 'wss:' : 'ws:';
+                    const wsUrl = `${protocol}//${window.location.host}/ws/conversation`;
+                    websocket = new WebSocket(wsUrl);
+                    websocket.binaryType = 'arraybuffer';
+                    websocket.onopen = () => {
+                        statusDiv.textContent = 'Status: Connected. Ready to record.';
+                        startRecordButton.disabled = false;
+                        initAudioContext();
+                    };
+                    websocket.onmessage = (event) => {
+                        if (event.data instanceof ArrayBuffer) {
+                            if (ttsStreaming && audioContext && expectedSampleRate) {
+                                const pcmDataInt16 = new Int16Array(event.data);
+                                if (pcmDataInt16.length > 0) {
+                                   assistantAudioBufferQueue.push(pcmDataInt16);
+                                   playNextChunkFromQueue();
+                                }
+                            } else {
+                                 console.warn("Received ArrayBuffer data but not in TTS streaming mode or AudioContext not ready.");
+                            }
+                        } else {
+                            const messageText = event.data;
+                            if (messageText.startsWith("USER_TRANSCRIPT:")) {
+                                const transcript = messageText.substring("USER_TRANSCRIPT:".length).trim();
+                                userTranscriptSpan.textContent = transcript;
+                            } else if (messageText.startsWith("ASSISTANT_RESPONSE_TEXT:")) {
+                                const llmResponse = messageText.substring("ASSISTANT_RESPONSE_TEXT:".length).trim();
+                                assistantTranscriptSpan.textContent = llmResponse;
+                                addMessage(`Assistant: ${llmResponse}`, 'assistant');
+                            } else if (messageText.startsWith("TTS_STREAM_START:")) {
+                                ttsStreaming = true;
+                                assistantAudioBufferQueue = [];
+                                audioPlaying = false;
+                                if (sourceNode) {
+                                    try { sourceNode.stop(); } catch(e) { console.warn("Error stopping previous sourceNode:", e); }
+                                    sourceNode = null;
+                                }
+                                audioPlayer.style.display = 'none';
+                                audioPlayer.src = "";
+                                try {
+                                    const paramsText = messageText.substring("TTS_STREAM_START:".length);
+                                    const params = JSON.parse(paramsText);
+                                    expectedSampleRate = params.sample_rate;
+                                    initAudioContext();
+                                    statusDiv.textContent = 'Status: Receiving audio stream...';
+                                    addMessage('Assistant (Audio stream starting...)', 'assistant');
+                                } catch (e) {
+                                    console.error("Could not parse TTS_STREAM_START params:", e);
+                                    statusDiv.textContent = 'Error: Could not parse audio stream parameters.';
+                                    ttsStreaming = false;
+                                }
+                            } else if (messageText === "TTS_STREAM_END") {
+                                ttsStreaming = false;
+                                if (!audioPlaying && assistantAudioBufferQueue.length === 0) {
+                                   statusDiv.textContent = 'Status: Audio stream finished (or was empty).';
+                                } else if (!audioPlaying && assistantAudioBufferQueue.length > 0) {
+                                   playNextChunkFromQueue();
+                                   statusDiv.textContent = 'Status: Audio stream finished. Playing remaining...';
+                                } else {
+                                   statusDiv.textContent = 'Status: Audio stream finished. Playing remaining...';
+                                }
+                                addMessage('Assistant (Audio stream ended)', 'assistant');
+                            } else if (messageText.startsWith("SYSTEM_ERROR:")) {
+                                const errorMsg = messageText.substring("SYSTEM_ERROR:".length).trim();
+                                addMessage(`System Error: ${errorMsg}`, 'system');
+                                statusDiv.textContent = `Error: ${errorMsg}`;
+                                ttsStreaming = false;
+                                assistantAudioBufferQueue = [];
+                            } else {
+                                addMessage(messageText, 'system');
+                            }
+                        }
+                    };
+                    websocket.onerror = (error) => {
+                        console.error('WebSocket Error:', error);
+                        statusDiv.textContent = 'Status: WebSocket error. Try reconnecting.';
+                        addMessage('WebSocket Error. Check console.', 'system');
+                        ttsStreaming = false;
+                    };
+                    websocket.onclose = () => {
+                        statusDiv.textContent = 'Status: Disconnected. Please refresh to reconnect.';
+                        startRecordButton.disabled = true;
+                        stopRecordButton.disabled = true;
+                        addMessage('Disconnected from server.', 'system');
+                        ttsStreaming = false;
+                        if (audioContext && audioContext.state !== 'closed') {
+                            audioContext.close().catch(e => console.warn("Error closing AudioContext:", e));
+                            audioContext = null;
+                            console.log("AudioContext closed.");
+                        }
+                    };
+                }
+                function playNextChunkFromQueue() {
+                    if (audioPlaying || assistantAudioBufferQueue.length === 0 || !audioContext || audioContext.state !== 'running' || !expectedSampleRate) {
+                        if (assistantAudioBufferQueue.length === 0 && !ttsStreaming && !audioPlaying) {
+                             console.log("Queue empty, not streaming, not playing: Playback complete.");
+                             statusDiv.textContent = 'Status: Audio playback complete.';
+                        }
+                        return;
+                    }
+                    audioPlaying = true;
+                    const pcmDataInt16 = assistantAudioBufferQueue.shift();
+                    const float32Pcm = new Float32Array(pcmDataInt16.length);
+                    for (let i = 0; i < pcmDataInt16.length; i++) {
+                        float32Pcm[i] = pcmDataInt16[i] / 32768.0;
+                    }
+                    const audioBuffer = audioContext.createBuffer(1, float32Pcm.length, expectedSampleRate);
+                    audioBuffer.getChannelData(0).set(float32Pcm);
+                    sourceNode = audioContext.createBufferSource();
+                    sourceNode.buffer = audioBuffer;
+                    sourceNode.connect(audioContext.destination);
+                    sourceNode.onended = () => {
+                        audioPlaying = false;
+                        if (ttsStreaming || assistantAudioBufferQueue.length > 0) {
+                            playNextChunkFromQueue();
+                        } else {
+                            statusDiv.textContent = 'Status: Audio playback finished.';
+                            console.log("All queued audio chunks played.");
+                        }
+                    };
+                    sourceNode.start();
+                    statusDiv.textContent = 'Status: Playing audio chunk...';
+                }
+                function addMessage(text, type) {
+                    const messageElement = document.createElement('div');
+                    messageElement.classList.add('message', type);
+                    messageElement.textContent = text;
+                    messagesDiv.appendChild(messageElement);
+                    messagesDiv.scrollTop = messagesDiv.scrollHeight;
+                }
+                startRecordButton.onclick = async () => {
+                     if (!websocket || websocket.readyState !== WebSocket.OPEN) {
+                        alert("WebSocket is not connected. Please wait or refresh.");
+                        return;
+                    }
+                     if (audioContext && audioContext.state === 'suspended') {
+                        audioContext.resume().catch(e => console.error("Error resuming AudioContext:", e));
+                    }
+                    initAudioContext();
+                    try {
+                        const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
+                        let options = { mimeType: 'audio/webm;codecs=opus' };
+                        if (!MediaRecorder.isTypeSupported(options.mimeType)) {
+                            console.warn(`${options.mimeType} is not supported, trying default.`);
+                            options = {};
+                        }
+                        mediaRecorder = new MediaRecorder(stream, options);
+                        userAudioChunks = [];
+                        mediaRecorder.ondataavailable = event => {
+                            if (event.data.size > 0) userAudioChunks.push(event.data);
+                        };
+                        mediaRecorder.onstop = () => {
+                            if (userAudioChunks.length === 0) {
+                                console.log("No audio data recorded.");
+                                statusDiv.textContent = 'Status: No audio data recorded. Try again.';
+                                startRecordButton.disabled = false;
+                                stopRecordButton.disabled = true;
+                                return;
+                            }
+                            const audioBlob = new Blob(userAudioChunks, { type: mediaRecorder.mimeType });
+                            if (websocket && websocket.readyState === WebSocket.OPEN) {
+                                websocket.send(audioBlob);
+                                statusDiv.textContent = 'Status: Audio sent. Waiting for response...';
+                            } else {
+                                statusDiv.textContent = 'Status: WebSocket not open. Cannot send audio.';
+                            }
+                            userAudioChunks = [];
+                        };
+                        mediaRecorder.start(250);
+                        startRecordButton.disabled = true;
+                        stopRecordButton.disabled = false;
+                        statusDiv.textContent = 'Status: Recording...';
+                        userTranscriptSpan.textContent = "...";
+                        assistantTranscriptSpan.textContent = "...";
+                        audioPlayer.style.display = 'none';
+                        audioPlayer.src = '';
+                        assistantAudioBufferQueue = [];
+                        if (sourceNode) { try {sourceNode.stop();} catch(e){} sourceNode = null; }
+                    } catch (err) {
+                        console.error('Error accessing microphone:', err);
+                        statusDiv.textContent = 'Status: Error accessing microphone.';
+                        alert('Could not access microphone: ' + err.message);
+                    }
+                };
+                stopRecordButton.onclick = () => {
+                     if (mediaRecorder && mediaRecorder.state === "recording") {
+                        mediaRecorder.stop();
+                        startRecordButton.disabled = false;
+                        stopRecordButton.disabled = true;
+                    }
+                };
+                connectWebSocket();
+            </script>
+        </body>
+    </html>
+    """
+    return HTMLResponse(content=html_content)
+if __name__ == "__main__":
+    uvicorn.run(app, host="0.0.0.0", port=8000, log_level="info")

parler-streaming.py ADDED Viewed

	@@ -0,0 +1,402 @@

+import io
+import math
+from queue import Queue
+from threading import Thread
+from typing import Optional
+import numpy as np
+# import spaces
+import gradio as gr
+import torch
+from parler_tts import ParlerTTSForConditionalGeneration
+from pydub import AudioSegment
+from transformers import AutoTokenizer, AutoFeatureExtractor, set_seed
+from transformers.generation.streamers import BaseStreamer
+device = "cuda:0" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
+torch_dtype = torch.float16 if device != "cpu" else torch.float32
+repo_id = "ai4bharat/indic-parler-tts"
+jenny_repo_id = "ylacombe/parler-tts-mini-jenny-30H"
+model = ParlerTTSForConditionalGeneration.from_pretrained(
+    repo_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
+).to(device)
+# jenny_model = ParlerTTSForConditionalGeneration.from_pretrained(
+#     jenny_repo_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True
+# ).to(device)
+tokenizer = AutoTokenizer.from_pretrained(repo_id)
+feature_extractor = AutoFeatureExtractor.from_pretrained(repo_id)
+SAMPLE_RATE = feature_extractor.sampling_rate
+SEED = 42
+default_text = "Please surprise me and speak in whatever voice you enjoy."
+examples = [
+    [
+        "Remember - this is only the first iteration of the model! To improve the prosody and naturalness of the speech further, we're scaling up the amount of training data by a factor of five times.",
+        "A male speaker with a low-pitched voice delivering his words at a fast pace in a small, confined space with a very clear audio and an animated tone.",
+        3.0,
+    ],
+    [
+        "'This is the best time of my life, Bartley,' she said happily.",
+        "A female speaker with a slightly low-pitched, quite monotone voice delivers her words at a slightly faster-than-average pace in a confined space with very clear audio.",
+        3.0,
+    ],
+    [
+        "Montrose also, after having experienced still more variety of good and bad fortune, threw down his arms, and retired out of the kingdom.",
+        "A male speaker with a slightly high-pitched voice delivering his words at a slightly slow pace in a small, confined space with a touch of background noise and a quite monotone tone.",
+        3.0,
+    ],
+    [
+        "Montrose also, after having experienced still more variety of good and bad fortune, threw down his arms, and retired out of the kingdom.",
+        "A male speaker with a low-pitched voice delivers his words at a fast pace and an animated tone, in a very spacious environment, accompanied by noticeable background noise.",
+        3.0,
+    ],
+]
+jenny_examples = [
+    [
+        "Remember, this is only the first iteration of the model! To improve the prosody and naturalness of the speech further, we're scaling up the amount of training data by a factor of five times.",
+        "Jenny speaks at an average pace with a slightly animated delivery in a very confined sounding environment with clear audio quality.",
+        3.0,
+    ],
+    [
+        "'This is the best time of my life, Bartley,' she said happily.",
+        "Jenny speaks in quite a monotone voice at a slightly faster-than-average pace in a confined space with very clear audio.",
+        3.0,
+    ],
+    [
+        "Montrose also, after having experienced still more variety of good and bad fortune, threw down his arms, and retired out of the kingdom.",
+        "Jenny delivers her words at a slightly slow pace in a small, confined space with a touch of background noise and a quite monotone tone.",
+        3.0,
+    ],
+    [
+        "Montrose also, after having experienced still more variety of good and bad fortune, threw down his arms, and retired out of the kingdom.",
+        "Jenny delivers her words at a fast pace and an animated tone, in a very spacious environment, accompanied by noticeable background noise.",
+        3.0,
+    ],
+]
+class ParlerTTSStreamer(BaseStreamer):
+    def __init__(
+        self,
+        model: ParlerTTSForConditionalGeneration,
+        device: Optional[str] = None,
+        play_steps: Optional[int] = 10,
+        stride: Optional[int] = None,
+        timeout: Optional[float] = None,
+    ):
+        """
+        Streamer that stores playback-ready audio in a queue, to be used by a downstream application as an iterator. This is
+        useful for applications that benefit from accessing the generated audio in a non-blocking way (e.g. in an interactive
+        Gradio demo).
+        Parameters:
+            model (`ParlerTTSForConditionalGeneration`):
+                The Parler-TTS model used to generate the audio waveform.
+            device (`str`, *optional*):
+                The torch device on which to run the computation. If `None`, will default to the device of the model.
+            play_steps (`int`, *optional*, defaults to 10):
+                The number of generation steps with which to return the generated audio array. Using fewer steps will
+                mean the first chunk is ready faster, but will require more codec decoding steps overall. This value
+                should be tuned to your device and latency requirements.
+            stride (`int`, *optional*):
+                The window (stride) between adjacent audio samples. Using a stride between adjacent audio samples reduces
+                the hard boundary between them, giving smoother playback. If `None`, will default to a value equivalent to
+                play_steps // 6 in the audio space.
+            timeout (`int`, *optional*):
+                The timeout for the audio queue. If `None`, the queue will block indefinitely. Useful to handle exceptions
+                in `.generate()`, when it is called in a separate thread.
+        """
+        self.decoder = model.decoder
+        self.audio_encoder = model.audio_encoder
+        self.generation_config = model.generation_config
+        self.device = device if device is not None else model.device
+        # variables used in the streaming process
+        self.play_steps = play_steps
+        if stride is not None:
+            self.stride = stride
+        else:
+            hop_length = math.floor(self.audio_encoder.config.sampling_rate / self.audio_encoder.config.frame_rate)
+            self.stride = hop_length * (play_steps - self.decoder.num_codebooks) // 6
+        self.token_cache = None
+        self.to_yield = 0
+        # varibles used in the thread process
+        self.audio_queue = Queue()
+        self.stop_signal = None
+        self.timeout = timeout
+    def apply_delay_pattern_mask(self, input_ids):
+        # build the delay pattern mask for offsetting each codebook prediction by 1 (this behaviour is specific to Parler)
+        _, delay_pattern_mask = self.decoder.build_delay_pattern_mask(
+            input_ids[:, :1],
+            bos_token_id=self.generation_config.bos_token_id,
+            pad_token_id=self.generation_config.decoder_start_token_id,
+            max_length=input_ids.shape[-1],
+        )
+        # apply the pattern mask to the input ids
+        input_ids = self.decoder.apply_delay_pattern_mask(input_ids, delay_pattern_mask)
+        # revert the pattern delay mask by filtering the pad token id
+        mask = (delay_pattern_mask != self.generation_config.bos_token_id) & (delay_pattern_mask != self.generation_config.pad_token_id)
+        input_ids = input_ids[mask].reshape(1, self.decoder.num_codebooks, -1)
+        # append the frame dimension back to the audio codes
+        input_ids = input_ids[None, ...]
+        # send the input_ids to the correct device
+        input_ids = input_ids.to(self.audio_encoder.device)
+        decode_sequentially = (
+            self.generation_config.bos_token_id in input_ids
+            or self.generation_config.pad_token_id in input_ids
+            or self.generation_config.eos_token_id in input_ids
+        )
+        if not decode_sequentially:
+            output_values = self.audio_encoder.decode(
+                input_ids,
+                audio_scales=[None],
+            )
+        else:
+            sample = input_ids[:, 0]
+            sample_mask = (sample >= self.audio_encoder.config.codebook_size).sum(dim=(0, 1)) == 0
+            sample = sample[:, :, sample_mask]
+            output_values = self.audio_encoder.decode(sample[None, ...], [None])
+        audio_values = output_values.audio_values[0, 0]
+        return audio_values.cpu().float().numpy()
+    def put(self, value):
+        batch_size = value.shape[0] // self.decoder.num_codebooks
+        if batch_size > 1:
+            raise ValueError("ParlerTTSStreamer only supports batch size 1")
+        if self.token_cache is None:
+            self.token_cache = value
+        else:
+            self.token_cache = torch.concatenate([self.token_cache, value[:, None]], dim=-1)
+        if self.token_cache.shape[-1] % self.play_steps == 0:
+            audio_values = self.apply_delay_pattern_mask(self.token_cache)
+            self.on_finalized_audio(audio_values[self.to_yield : -self.stride])
+            self.to_yield += len(audio_values) - self.to_yield - self.stride
+    def end(self):
+        """Flushes any remaining cache and appends the stop symbol."""
+        if self.token_cache is not None:
+            audio_values = self.apply_delay_pattern_mask(self.token_cache)
+        else:
+            audio_values = np.zeros(self.to_yield)
+        self.on_finalized_audio(audio_values[self.to_yield :], stream_end=True)
+    def on_finalized_audio(self, audio: np.ndarray, stream_end: bool = False):
+        """Put the new audio in the queue. If the stream is ending, also put a stop signal in the queue."""
+        self.audio_queue.put(audio, timeout=self.timeout)
+        if stream_end:
+            self.audio_queue.put(self.stop_signal, timeout=self.timeout)
+    def __iter__(self):
+        return self
+    def __next__(self):
+        value = self.audio_queue.get(timeout=self.timeout)
+        if not isinstance(value, np.ndarray) and value == self.stop_signal:
+            raise StopIteration()
+        else:
+            return value
+def numpy_to_mp3(audio_array, sampling_rate):
+    # Normalize audio_array if it's floating-point
+    if np.issubdtype(audio_array.dtype, np.floating):
+        max_val = np.max(np.abs(audio_array))
+        audio_array = (audio_array / max_val) * 32767  # Normalize to 16-bit range
+        audio_array = audio_array.astype(np.int16)
+    # Create an audio segment from the numpy array
+    audio_segment = AudioSegment(
+        audio_array.tobytes(),
+        frame_rate=sampling_rate,
+        sample_width=audio_array.dtype.itemsize,
+        channels=1
+    )
+    # Export the audio segment to MP3 bytes - use a high bitrate to maximise quality
+    mp3_io = io.BytesIO()
+    audio_segment.export(mp3_io, format="mp3", bitrate="320k")
+    # Get the MP3 bytes
+    mp3_bytes = mp3_io.getvalue()
+    mp3_io.close()
+    return mp3_bytes
+sampling_rate = model.audio_encoder.config.sampling_rate
+frame_rate = model.audio_encoder.config.frame_rate
+# @spaces.GPU
+def generate_base(text, description, play_steps_in_s=2.0):
+    play_steps = int(frame_rate * play_steps_in_s)
+    streamer = ParlerTTSStreamer(model, device=device, play_steps=play_steps)
+    inputs = tokenizer(description, return_tensors="pt").to(device)
+    prompt = tokenizer(text, return_tensors="pt").to(device)
+    generation_kwargs = dict(
+        input_ids=inputs.input_ids,
+        prompt_input_ids=prompt.input_ids,
+        streamer=streamer,
+        do_sample=True,
+        temperature=1.0,
+        min_new_tokens=10,
+    )
+    set_seed(SEED)
+    thread = Thread(target=model.generate, kwargs=generation_kwargs)
+    thread.start()
+    for new_audio in streamer:
+        print(f"Sample of length: {round(new_audio.shape[0] / sampling_rate, 2)} seconds")
+        yield numpy_to_mp3(new_audio, sampling_rate=sampling_rate)
+# @spaces.GPU
+def generate_jenny(text, description, play_steps_in_s=2.0):
+    play_steps = int(frame_rate * play_steps_in_s)
+    streamer = ParlerTTSStreamer(model, device=device, play_steps=play_steps)
+    inputs = tokenizer(description, return_tensors="pt").to(device)
+    prompt = tokenizer(text, return_tensors="pt").to(device)
+    generation_kwargs = dict(
+        input_ids=inputs.input_ids,
+        prompt_input_ids=prompt.input_ids,
+        streamer=streamer,
+        do_sample=True,
+        temperature=1.0,
+        min_new_tokens=10,
+    )
+    set_seed(SEED)
+    thread = Thread(target=jenny_model.generate, kwargs=generation_kwargs)
+    thread.start()
+    for new_audio in streamer:
+        print(f"Sample of length: {round(new_audio.shape[0] / sampling_rate, 2)} seconds")
+        yield sampling_rate, new_audio
+css = """
+        #share-btn-container {
+            display: flex;
+            padding-left: 0.5rem !important;
+            padding-right: 0.5rem !important;
+            background-color: #000000;
+            justify-content: center;
+            align-items: center;
+            border-radius: 9999px !important;
+            width: 13rem;
+            margin-top: 10px;
+            margin-left: auto;
+            flex: unset !important;
+        }
+        #share-btn {
+            all: initial;
+            color: #ffffff;
+            font-weight: 600;
+            cursor: pointer;
+            font-family: 'IBM Plex Sans', sans-serif;
+            margin-left: 0.5rem !important;
+            padding-top: 0.25rem !important;
+            padding-bottom: 0.25rem !important;
+            right:0;
+        }
+        #share-btn * {
+            all: unset !important;
+        }
+        #share-btn-container div:nth-child(-n+2){
+            width: auto !important;
+            min-height: 0px !important;
+        }
+        #share-btn-container .wrap {
+            display: none !important;
+        }
+"""
+with gr.Blocks(css=css) as block:
+    gr.HTML(
+        """
+            <div style="text-align: center; max-width: 700px; margin: 0 auto;">
+              <div
+                style="
+                  display: inline-flex; align-items: center; gap: 0.8rem; font-size: 1.75rem;
+                "
+              >
+                <h1 style="font-weight: 900; margin-bottom: 7px; line-height: normal;">
+                  Parler-TTS 🗣️
+                </h1>
+              </div>
+            </div>
+        """
+    )
+    gr.HTML(
+        f"""
+        <p><a href="https://github.com/huggingface/parler-tts"> Parler-TTS</a> is a training and inference library for
+        high-fidelity text-to-speech (TTS) models. Two models are demonstrated here, <a href="https://huggingface.co/parler-tts/parler_tts_mini_v0.1"> Parler-TTS Mini v0.1</a>,
+        is the first iteration model trained using 10k hours of narrated audiobooks, and <a href="https://huggingface.co/ylacombe/parler-tts-mini-jenny-30H"> Parler-TTS Jenny</a>,
+        a model fine-tuned on the <a href="https://huggingface.co/datasets/reach-vb/jenny_tts_dataset"> Jenny dataset</a>.
+        Both models generates high-quality speech with features that can be controlled using a simple text prompt (e.g. gender, background noise, speaking rate, pitch and reverberation).</p>
+        <p>Tips for ensuring good generation:
+        <ul>
+            <li>Include the term <b>"very clear audio"</b> to generate the highest quality audio, and "very noisy audio" for high levels of background noise</li>
+            <li>When using the fine-tuned model, include the term <b>"Jenny"</b> to pick out her voice</li>
+            <li>Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech</li>
+            <li>The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt</li>
+        </ul>
+        </p>
+        """
+    )
+    with gr.Tab("Base"):
+        with gr.Row():
+            with gr.Column():
+                input_text = gr.Textbox(label="Input Text", lines=2, value=default_text, elem_id="input_text")
+                description = gr.Textbox(label="Description", lines=2, value="", elem_id="input_description")
+                play_seconds = gr.Slider(3.0, 7.0, value=3.0, step=2, label="Streaming interval in seconds", info="Lower = shorter chunks, lower latency, more codec steps")
+                run_button = gr.Button("Generate Audio", variant="primary")
+            with gr.Column():
+                audio_out = gr.Audio(label="Parler-TTS generation", format="mp3", elem_id="audio_out", streaming=True, autoplay=True)
+        inputs = [input_text, description, play_seconds]
+        outputs = [audio_out]
+        gr.Examples(examples=examples, fn=generate_base, inputs=inputs, outputs=outputs, cache_examples=False)
+        run_button.click(fn=generate_base, inputs=inputs, outputs=outputs, queue=True)
+    with gr.Tab("Jenny"):
+        with gr.Row():
+            with gr.Column():
+                input_text = gr.Textbox(label="Input Text", lines=2, value=jenny_examples[0][0], elem_id="input_text")
+                description = gr.Textbox(label="Description", lines=2, value=jenny_examples[0][1], elem_id="input_description")
+                play_seconds = gr.Slider(3.0, 7.0, value=jenny_examples[0][2], step=2, label="Streaming interval in seconds", info="Lower = shorter chunks, lower latency, more codec steps")
+                run_button = gr.Button("Generate Audio", variant="primary")
+            with gr.Column():
+                audio_out = gr.Audio(label="Parler-TTS generation", format="mp3", elem_id="audio_out", streaming=True, autoplay=True)
+        inputs = [input_text, description, play_seconds]
+        outputs = [audio_out]
+        gr.Examples(examples=jenny_examples, fn=generate_jenny, inputs=inputs, outputs=outputs, cache_examples=False)
+        run_button.click(fn=generate_jenny, inputs=inputs, outputs=outputs, queue=True)
+    gr.HTML(
+        """
+        <p>To improve the prosody and naturalness of the speech further, we're scaling up the amount of training data to 50k hours of speech.
+        The v1 release of the model will be trained on this data, as well as inference optimisations, such as flash attention
+        and torch compile, that will improve the latency by 2-4x. If you want to find out more about how this model was trained and even fine-tune it yourself, check-out the
+        <a href="https://github.com/huggingface/parler-tts"> Parler-TTS</a> repository on GitHub. The Parler-TTS codebase and its
+        associated checkpoints are licensed under <a href='https://github.com/huggingface/parler-tts?tab=Apache-2.0-1-ov-file#readme'> Apache 2.0</a>.</p>
+        """
+    )
+block.queue()
+block.launch(share=True)

requirements.txt ADDED Viewed

	@@ -0,0 +1,22 @@

+# requirements.txt
+fastapi
+uvicorn[standard]
+websockets
+openai-whisper
+torch
+torchaudio
+transformers
+accelerate # Often useful for transformers
+python-multipart # For file uploads in traditional endpoints
+soundfile # For handling audio files
+librosa
+parler-tts # For AI4Bharat's IndicParler-TTS
+onnx
+onnxruntime
+# For specific hardware acceleration (optional, choose based on your setup)
+# bitsandbytes # For 8-bit quantization of LLM (further RAM reduction)
+# sentencepiece # Often a dependency for tokenizers
+# For demo purposes
+gradio
+# For Gemini integration
+google-generativeai

streaming_nb.ipynb ADDED Viewed

File without changes

test_notebook.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff