Issues with hallucination and repetition

#3
by YungGump - opened

Thanks for sharing your work. I must say I'm impressed with the generation speed and how close the voice gets to the actual reference sample sometimes.

While testing your work for a bit (English only), I noticed a few unwanted behaviours:

  1. Model tends to generate (unrealistic) long pauses in the speech
  2. The model can have snippets that match the reference voice very well, and other snippets - in the same output, after a long pause - that totally don't match
  3. The model also has issues with repetition. For example, it would output something like "Hello my name is Forest Forest.". In this case, repeating the name twice.

The setup of my tests:

  • I used 1 minute of clear reference audio, no background noise
  • I tried with both Auto-Labeling on and off
  • Let all the advanced configs on their default
  • I only tested with a very simple example: "What's up, my name is Forest" and "Hello, my name is Forest"

Are you aware of this behaviour, if yes, do you have instructions on how to cope with this or fix this?
@lengyue233 @Phillippe @supurreme @HuanLin @auitumnc @lj1995 @innnky @CjangCjengh

Fish Audio org

We recommend you to upload 20-40s of audio.
Also, please try again if the model generates unsatisfying results.
Have fun!

Fish Audio org

We are doing some internal research on these issues, the long pauses issue should be fixed in the near future, but repeating / missing some words is hard to solve giving its an auto-regressive model, but enabling rerank can reduce the probability.

@PoTaTo721 I tried it with a 25 second audio clip and it's actually worse than what I experienced yesterday. For the simple phrase "Hello, my name is Forest" it now generates audio samples that go up to almost 50 seconds. I attached a screenshot of the worst case I had.
Scherm­afbeelding 2024-09-12 om 10.33.27.png

@lengyue233 Thanks a lot for your input. Do you think there could be non-AI related post processing methods that could help with filtering this out?

Fish Audio org

I feel there are some issue in my torch compile code, debuging

@lengyue233 any updates?

Fish Audio org

We had some fix in the repo that fix some kernel error. Maybe give it a try?

Example of the second issue.

Used API request parameters:
https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena/blob/main/app.py#L291-L304

@lengyue233 Managed to get the second issue twice on this FS space no matter if Maximum tokens per batch is 0 or 1024. Like 5% chance of occurring.

I can't believe it, we won the championship! This is the best day ever!

Fish Audio org

Did you specify any prompt audio?

Yes. English.wav

New one; Fish Speech 1.5
"I'm absolutely devastated, my world has fallen apart."

Used params: https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena/blob/2dfe855f31bfe121cd2c671e6c30e32883332d4e/app.py#L411-L424

Sign up or log in to comment