Some modifications on gradio interface

#10

by pkanda - opened Mar 8

Discussion

pkanda

Mar 8

I did some modifications in AgentF5TTSChunk.py

Improved Error Handling

Original: Did not check if model_path or output_audio_folder existed before processing.
Improved: Now verifies that model_path exists before initializing the model, and that output_audio_folder exists before proceeding. If they don’t exist, it logs an error and prevents execution to avoid crashes.
Output Audio Handling

Original: Expected a file path for output audio.
Improved: Now asks for an output folder instead of a specific file, dynamically creating "generated_audio.wav" inside the chosen folder.
File Validation Before Returning Output

Original: Directly returned the generated file path.
Improved: Now checks if the generated file exists before returning it to Gradio. If it doesn’t exist, logs an error and returns None.
More Robust Logging

Original: Limited logging for errors.
Improved: Added logging for missing files and incorrect paths to help with debugging.
Gradio Input Adjustments

Original: Accepted a file path for the model as a string.
Improved: Now uses gr.File for model_path and gr.Textbox for output_audio_folder to ensure the correct types are received.

script:

import os
import re
import time
import logging
import json
import subprocess
import gradio as gr
from f5_tts.api import F5TTS

Constants

CONFIG_FILE = "last_inputs.json"

Initialize logging

logging.basicConfig(level=logging.INFO)

class AgentF5TTS:
def init(self, ckpt_file, vocoder_name="vocos", delay=0, device="cuda"):
self.model = F5TTS(ckpt_file=ckpt_file, vocoder_name=vocoder_name, device=device)
self.delay = delay

def generate_emotion_speech(self, text, output_audio_folder, speaker_emotion_refs, convert_to_mp3=False):
    lines = [line.strip() for line in text.split("\n") if line.strip()]

    if not lines:
        logging.error("Input text is empty.")
        return

    if not output_audio_folder:
        logging.error("Output audio folder is not specified.")
        return None
    if not os.path.exists(output_audio_folder):
        os.makedirs(output_audio_folder, exist_ok=True)
    output_audio_file = os.path.join(output_audio_folder, "generated_audio.wav")
    temp_files = []

    for i, line in enumerate(lines):
        speaker, emotion = self._determine_speaker_emotion(line)
        ref_audio = speaker_emotion_refs.get((speaker, emotion))
        line = re.sub(r'\[speaker:.*?\]\s*', '', line)
        if not ref_audio or not os.path.exists(ref_audio):
            logging.error(f"Reference audio not found for speaker '{speaker}', emotion '{emotion}'.")
            continue

        temp_file = os.path.join(output_audio_folder, f"line_{i + 1}.wav")

        try:
            logging.info(f"Generating speech for line {i + 1}: '{line}' with speaker '{speaker}', emotion '{emotion}'")
            self.model.infer(
                ref_file=ref_audio,
                ref_text="",  # Placeholder or load corresponding text
                gen_text=line,
                file_wave=temp_file,
                remove_silence=True
            )
            temp_files.append(temp_file)
            time.sleep(self.delay)
        except Exception as e:
            logging.error(f"Error generating speech for line {i + 1}: {e}")

    self.combine_audio_files(temp_files, output_audio_file, convert_to_mp3)
    return output_audio_file

def _determine_speaker_emotion(self, text):
    speaker, emotion = "speaker1", "neutral"
    match = re.search(r"\[speaker:(.*?), emotion:(.*?)\]", text)
    if match:
        speaker = match.group(1).strip()
        emotion = match.group(2).strip()
    return speaker, emotion

def combine_audio_files(self, temp_files, output_audio_file, convert_to_mp3):
    if not temp_files:
        logging.error("No audio files to combine.")
        return

    list_file = os.path.join(os.path.dirname(output_audio_file), "file_list.txt")
    with open(list_file, "w") as f:
        for temp in temp_files:
            f.write(f"file '{temp}'\n")

    try:
        subprocess.run(["ffmpeg", "-y", "-f", "concat", "-safe", "0", "-i", list_file, "-c", "copy", output_audio_file], check=True)
        if convert_to_mp3:
            mp3_output = output_audio_file.replace(".wav", ".mp3")
            subprocess.run(["ffmpeg", "-y", "-i", output_audio_file, "-codec:a", "libmp3lame", "-qscale:a", "2", mp3_output], check=True)
            logging.info(f"Converted to MP3: {mp3_output}")
        for temp in temp_files:
            os.remove(temp)
        os.remove(list_file)
    except Exception as e:
        logging.error(f"Error combining audio files: {e}")

Load last inputs from JSON file

def load_last_inputs():
if os.path.exists(CONFIG_FILE):
with open(CONFIG_FILE, "r") as f:
return json.load(f)
return {}

Save inputs to JSON file

def save_last_inputs(inputs):
with open(CONFIG_FILE, "w") as f:
json.dump(inputs, f, indent=4)

Gradio Interface

def gradio_interface(model_path, vocoder_name, delay, device, text, output_audio_folder, ref_audio, convert_to_mp3, speaker1_happy, speaker1_sad, speaker1_angry, speaker1_neutral):
if not os.path.exists(model_path.name):
logging.error(f"Model path does not exist: {model_path.name}")
return None
agent = AgentF5TTS(ckpt_file=model_path.name, vocoder_name=vocoder_name, delay=delay, device=device)
speaker_emotion_refs = {
("speaker1", "happy"): speaker1_happy.name,
("speaker1", "sad"): speaker1_sad.name,
("speaker1", "angry"): speaker1_angry.name,
("speaker1", "neutral"): speaker1_neutral.name,
}
if not os.path.exists(output_audio_folder):
logging.error(f"Output audio folder does not exist: {output_audio_folder}")
return None
output_file = agent.generate_emotion_speech(text, output_audio_folder, speaker_emotion_refs, convert_to_mp3)
return output_file if os.path.exists(output_file) else None

Launch Gradio App

iface = gr.Interface(
fn=gradio_interface,
inputs=[
gr.File(label="Model Path"),
gr.Dropdown(label="Vocoder", choices=["vocos", "bigvgan"]),
gr.Number(label="Delay (seconds)"),
gr.Dropdown(label="Device", choices=["cpu", "cuda"]),
gr.Textbox(label="Input Text"),
gr.Textbox(label="Output Audio Folder"),
gr.File(label="Reference Audio File"),
gr.Checkbox(label="Convert to MP3"),
gr.File(label="Speaker1 Happy Reference Audio"),
gr.File(label="Speaker1 Sad Reference Audio"),
gr.File(label="Speaker1 Angry Reference Audio"),
gr.File(label="Speaker1 Neutral Reference Audio")
],
outputs=gr.Audio(label="Generated Audio"),
title="F5-TTS Text-to-Speech Generator",
description="Generate speech from text using the F5-TTS model."
)

iface.launch()

acleitao

Mar 25

Oh hey there! thank you for your work I know it can be hard not only to train but deal with people like me asking question lol. .... Am I missing something here? I have an installation of the original F5-tts running with gradio. I went to the custom tab added your safetensor, used the same vocab.txt(I think I read somewhere her you should use the same) and ran it... Reference is 12 sec Brazilian pt, all I got back was some audio in another voice that its impossible to understand looks like talking jibberish. Do I have to set any other parameters on the original gradio? Use a totally different one? or add this code to the original one?

Sorry... just trying to figure this out

acleitao

Mar 25

Yeah I made it work... I just used your interface... but at the end of the day... output for me is just jibberish... but thank you again... at least put me in the right path

pkanda

Mar 26

Hi, vocab.txt should have just same number of lines, I added more audios of my own and from other fonts to increase vocabulary, and I'm training it again, with sucess. Do all steps to retrain and keep size of tolkenizer. But to get good result I'm 600 epochs.

firstpixel

Owner Apr 1

I will do it from scratch. I plan on doing with multilingual, english and portuguese, many words for IT are in english. The first round I did on this one was almost 5s, common voice is mostly 5s, so the inference is not good and is too short, I plan on doing with longer audios, so I did a python to join multiple audios of same speaker, starting low with 5s till it reach top of 20 to 25s.
When I have the new multilingual, I will post here.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment