Spaces:

fishaudio
/

fish-speech-1

Running on L4

Issues with hallucination and repetition

by YungGump - opened Sep 11, 2024

Sep 11, 2024

Thanks for sharing your work. I must say I'm impressed with the generation speed and how close the voice gets to the actual reference sample sometimes.

While testing your work for a bit (English only), I noticed a few unwanted behaviours:

Model tends to generate (unrealistic) long pauses in the speech
The model can have snippets that match the reference voice very well, and other snippets - in the same output, after a long pause - that totally don't match
The model also has issues with repetition. For example, it would output something like "Hello my name is Forest Forest.". In this case, repeating the name twice.

The setup of my tests:

I used 1 minute of clear reference audio, no background noise
I tried with both Auto-Labeling on and off
Let all the advanced configs on their default
I only tested with a very simple example: "What's up, my name is Forest" and "Hello, my name is Forest"

Are you aware of this behaviour, if yes, do you have instructions on how to cope with this or fix this?
@lengyue233 @Phillippe @supurreme @HuanLin @auitumnc @lj1995 @innnky @CjangCjengh

PoTaTo721

Fish Audio org Sep 11, 2024

We recommend you to upload 20-40s of audio.
Also, please try again if the model generates unsatisfying results.
Have fun!

lengyue233

Fish Audio org Sep 12, 2024

We are doing some internal research on these issues, the long pauses issue should be fixed in the near future, but repeating / missing some words is hard to solve giving its an auto-regressive model, but enabling rerank can reduce the probability.

YungGump

Sep 12, 2024

@PoTaTo721 I tried it with a 25 second audio clip and it's actually worse than what I experienced yesterday. For the simple phrase "Hello, my name is Forest" it now generates audio samples that go up to almost 50 seconds. I attached a screenshot of the worst case I had.