Higgs Audio v2 Voice Cloning Issues - Character Limit (?) and AudioContent Problems

#14
by dhairyashil - opened

I just wanted audio of long text in consistent voice. First of all, I did not get audio output above 40.6 seconds even for longer text. Then I tried to break text but that gived different voices over multiple clips.

The outputs are interesting but I came across few issues that were not easy to resolve.
Very much possible that I might be messing up something basic.

Main issues observed:

  • AudioContent + text combinations fail above ~25 characters.
  • That "shape '[1, 1, 1, 1]' is invalid for input of size 0" error
  • Documentation parameter mismatches (audio_url vs waveform vs audio)

What works well:

  • Text-only generation with enhanced prompting
  • High audio quality when it works
  • Reasonable GPU performance

Looking for community input on the root causes and potential solutions. Has anyone gotten reliable voice cloning working with longer texts?

I think this should not happen. How did you prompt the model for long-text generation?

Please have a look at this - https://github.com/dhairyasil/higgs-audio-testing
(Disclaimer: vibe coding has been used at various stages)
Thanks.

Can I train a Lithuanian language using this system?

Please have a look at this - https://github.com/dhairyasil/higgs-audio-testing
(Disclaimer: vibe coding has been used at various stages)
Thanks.

You can probably do something like this, using FFMPEG, as far as audio inconsistencies. At least, as far as loudness is concerned, I'm not sure what other metrics would be used to measure inconsistency... Perhaps overall quality, as in compression affects, or affects of learning from referenced audio with varying levels of compression/quality?

Anyway I used this when piecing together a clean UI for chatterbox:
def normalizer(self, output_path):
import subprocess
output_path_processed = str(output_path).replace(".wav", "_normalized.wav")

    command = [
        "ffmpeg",
        "-i", str(output_path),
        "-af", "loudnorm",
        output_path_processed
    ]
    
    try:
        subprocess.run(command, check=True)
        return output_path_processed
    except subprocess.CalledProcessError as e:
        print(f"FFMPEG Loudnorm Error: {e}")
        return output_path  # fallback to original if failed

I think this should not happen. How did you prompt the model for long-text generation?

Did you get chance to look at the examples provided in repo? Thanks.

https://github.com/dhairyasil/higgs-audio-testing

Sign up or log in to comment