| # Building Conversational Chatbots with Gradio | |
| Tags: AUDIO, STREAMING, CHATBOTS | |
| ## Introduction | |
| The next generation of AI user interfaces is moving towards audio-native experiences. Users will be able to speak to chatbots and receive spoken responses in return. Several models have been built under this paradigm, including GPT-4o and [mini omni](https://github.com/gpt-omni/mini-omni). | |
| In this guide, we'll walk you through building your own conversational chat application using mini omni as an example. You can see a demo of the finished app below: | |
| <video src="https://github.com/user-attachments/assets/db36f4db-7535-49f1-a2dd-bd36c487ebdf" controls | |
| height="600" width="600" style="display: block; margin: auto;" autoplay="true" loop="true"> | |
| </video> | |
| ## Application Overview | |
| Our application will enable the following user experience: | |
| 1. Users click a button to start recording their message | |
| 2. The app detects when the user has finished speaking and stops recording | |
| 3. The user's audio is passed to the omni model, which streams back a response | |
| 4. After omni mini finishes speaking, the user's microphone is reactivated | |
| 5. All previous spoken audio, from both the user and omni, is displayed in a chatbot component | |
| Let's dive into the implementation details. | |
| ## Processing User Audio | |
| We'll stream the user's audio from their microphone to the server and determine if the user has stopped speaking on each new chunk of audio. | |
| Here's our `process_audio` function: | |
| ```python | |
| import numpy as np | |
| from utils import determine_pause | |
| def process_audio(audio: tuple, state: AppState): | |
| if state.stream is None: | |
| state.stream = audio[1] | |
| state.sampling_rate = audio[0] | |
| else: | |
| state.stream = np.concatenate((state.stream, audio[1])) | |
| pause_detected = determine_pause(state.stream, state.sampling_rate, state) | |
| state.pause_detected = pause_detected | |
| if state.pause_detected and state.started_talking: | |
| return gr.Audio(recording=False), state | |
| return None, state | |
| ``` | |
| This function takes two inputs: | |
| 1. The current audio chunk (a tuple of `(sampling_rate, numpy array of audio)`) | |
| 2. The current application state | |
| We'll use the following `AppState` dataclass to manage our application state: | |
| ```python | |
| from dataclasses import dataclass | |
| @dataclass | |
| class AppState: | |
| stream: np.ndarray | None = None | |
| sampling_rate: int = 0 | |
| pause_detected: bool = False | |
| stopped: bool = False | |
| conversation: list = [] | |
| ``` | |
| The function concatenates new audio chunks to the existing stream and checks if the user has stopped speaking. If a pause is detected, it returns an update to stop recording. Otherwise, it returns `None` to indicate no changes. | |
| The implementation of the `determine_pause` function is specific to the omni-mini project and can be found [here](https://huggingface.co/spaces/gradio/omni-mini/blob/eb027808c7bfe5179b46d9352e3fa1813a45f7c3/app.py#L98). | |
| ## Generating the Response | |
| After processing the user's audio, we need to generate and stream the chatbot's response. Here's our `response` function: | |
| ```python | |
| import io | |
| import tempfile | |
| from pydub import AudioSegment | |
| def response(state: AppState): | |
| if not state.pause_detected and not state.started_talking: | |
| return None, AppState() | |
| audio_buffer = io.BytesIO() | |
| segment = AudioSegment( | |
| state.stream.tobytes(), | |
| frame_rate=state.sampling_rate, | |
| sample_width=state.stream.dtype.itemsize, | |
| channels=(1 if len(state.stream.shape) == 1 else state.stream.shape[1]), | |
| ) | |
| segment.export(audio_buffer, format="wav") | |
| with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f: | |
| f.write(audio_buffer.getvalue()) | |
| state.conversation.append({"role": "user", | |
| "content": {"path": f.name, | |
| "mime_type": "audio/wav"}}) | |
| output_buffer = b"" | |
| for mp3_bytes in speaking(audio_buffer.getvalue()): | |
| output_buffer += mp3_bytes | |
| yield mp3_bytes, state | |
| with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f: | |
| f.write(output_buffer) | |
| state.conversation.append({"role": "assistant", | |
| "content": {"path": f.name, | |
| "mime_type": "audio/mp3"}}) | |
| yield None, AppState(conversation=state.conversation) | |
| ``` | |
| This function: | |
| 1. Converts the user's audio to a WAV file | |
| 2. Adds the user's message to the conversation history | |
| 3. Generates and streams the chatbot's response using the `speaking` function | |
| 4. Saves the chatbot's response as an MP3 file | |
| 5. Adds the chatbot's response to the conversation history | |
| Note: The implementation of the `speaking` function is specific to the omni-mini project and can be found [here](https://huggingface.co/spaces/gradio/omni-mini/blob/main/app.py#L116). | |
| ## Building the Gradio App | |
| Now let's put it all together using Gradio's Blocks API: | |
| ```python | |
| import gradio as gr | |
| def start_recording_user(state: AppState): | |
| if not state.stopped: | |
| return gr.Audio(recording=True) | |
| with gr.Blocks() as demo: | |
| with gr.Row(): | |
| with gr.Column(): | |
| input_audio = gr.Audio( | |
| label="Input Audio", sources="microphone", type="numpy" | |
| ) | |
| with gr.Column(): | |
| chatbot = gr.Chatbot(label="Conversation", type="messages") | |
| output_audio = gr.Audio(label="Output Audio", streaming=True, autoplay=True) | |
| state = gr.State(value=AppState()) | |
| stream = input_audio.stream( | |
| process_audio, | |
| [input_audio, state], | |
| [input_audio, state], | |
| stream_every=0.5, | |
| time_limit=30, | |
| ) | |
| respond = input_audio.stop_recording( | |
| response, | |
| [state], | |
| [output_audio, state] | |
| ) | |
| respond.then(lambda s: s.conversation, [state], [chatbot]) | |
| restart = output_audio.stop( | |
| start_recording_user, | |
| [state], | |
| [input_audio] | |
| ) | |
| cancel = gr.Button("Stop Conversation", variant="stop") | |
| cancel.click(lambda: (AppState(stopped=True), gr.Audio(recording=False)), None, | |
| [state, input_audio], cancels=[respond, restart]) | |
| if __name__ == "__main__": | |
| demo.launch() | |
| ``` | |
| This setup creates a user interface with: | |
| - An input audio component for recording user messages | |
| - A chatbot component to display the conversation history | |
| - An output audio component for the chatbot's responses | |
| - A button to stop and reset the conversation | |
| The app streams user audio in 0.5-second chunks, processes it, generates responses, and updates the conversation history accordingly. | |
| ## Conclusion | |
| This guide demonstrates how to build a conversational chatbot application using Gradio and the mini omni model. You can adapt this framework to create various audio-based chatbot demos. To see the full application in action, visit the Hugging Face Spaces demo: https://huggingface.co/spaces/gradio/omni-mini | |
| Feel free to experiment with different models, audio processing techniques, or user interface designs to create your own unique conversational AI experiences! |