Higgs Audio v2 Voice Cloning Issues - Character Limit (?) and AudioContent Problems

#14

by dhairyashil - opened 4 days ago

4 days ago

I just wanted audio of long text in consistent voice. First of all, I did not get audio output above 40.6 seconds even for longer text. Then I tried to break text but that gived different voices over multiple clips.

The outputs are interesting but I came across few issues that were not easy to resolve.
Very much possible that I might be messing up something basic.

Main issues observed:

AudioContent + text combinations fail above ~25 characters.
That "shape '[1, 1, 1, 1]' is invalid for input of size 0" error
Documentation parameter mismatches (audio_url vs waveform vs audio)

What works well:

Text-only generation with enhanced prompting
High audio quality when it works
Reasonable GPU performance

Looking for community input on the root causes and potential solutions. Has anyone gotten reliable voice cloning working with longer texts?

xingjian-bosonai

BosonAI org 4 days ago

I think this should not happen. How did you prompt the model for long-text generation?

dhairyashil

3 days ago

•

edited 3 days ago

Please have a look at this - https://github.com/dhairyasil/higgs-audio-testing
(Disclaimer: vibe coding has been used at various stages)
Thanks.

Arturas-Hopro

3 days ago

Can I train a Lithuanian language using this system?

jattoedaltni

1 day ago

Please have a look at this - https://github.com/dhairyasil/higgs-audio-testing
(Disclaimer: vibe coding has been used at various stages)
Thanks.

You can probably do something like this, using FFMPEG, as far as audio inconsistencies. At least, as far as loudness is concerned, I'm not sure what other metrics would be used to measure inconsistency... Perhaps overall quality, as in compression affects, or affects of learning from referenced audio with varying levels of compression/quality?

Anyway I used this when piecing together a clean UI for chatterbox:
def normalizer(self, output_path):
import subprocess
output_path_processed = str(output_path).replace(".wav", "_normalized.wav")

    command = [
        "ffmpeg",
        "-i", str(output_path),
        "-af", "loudnorm",
        output_path_processed
    ]
    
    try:
        subprocess.run(command, check=True)
        return output_path_processed
    except subprocess.CalledProcessError as e:
        print(f"FFMPEG Loudnorm Error: {e}")
        return output_path  # fallback to original if failed

dhairyashil

about 9 hours ago

I think this should not happen. How did you prompt the model for long-text generation?

Did you get chance to look at the examples provided in repo? Thanks.

https://github.com/dhairyasil/higgs-audio-testing

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment