Spaces:
Running
on
A10G
Issues with hallucination and repetition
Thanks for sharing your work. I must say I'm impressed with the generation speed and how close the voice gets to the actual reference sample sometimes.
While testing your work for a bit (English only), I noticed a few unwanted behaviours:
- Model tends to generate (unrealistic) long pauses in the speech
- The model can have snippets that match the reference voice very well, and other snippets - in the same output, after a long pause - that totally don't match
- The model also has issues with repetition. For example, it would output something like "Hello my name is Forest Forest.". In this case, repeating the name twice.
The setup of my tests:
- I used 1 minute of clear reference audio, no background noise
- I tried with both
Auto-Labeling
on and off - Let all the advanced configs on their default
- I only tested with a very simple example: "What's up, my name is Forest" and "Hello, my name is Forest"
Are you aware of this behaviour, if yes, do you have instructions on how to cope with this or fix this?
@lengyue233
@Phillippe
@supurreme
@HuanLin
@auitumnc
@lj1995
@innnky
@CjangCjengh
We recommend you to upload 20-40s of audio.
Also, please try again if the model generates unsatisfying results.
Have fun!
We are doing some internal research on these issues, the long pauses issue should be fixed in the near future, but repeating / missing some words is hard to solve giving its an auto-regressive model, but enabling rerank can reduce the probability.
@PoTaTo721
I tried it with a 25 second audio clip and it's actually worse than what I experienced yesterday. For the simple phrase "Hello, my name is Forest" it now generates audio samples that go up to almost 50 seconds. I attached a screenshot of the worst case I had.
@lengyue233 Thanks a lot for your input. Do you think there could be non-AI related post processing methods that could help with filtering this out?
I feel there are some issue in my torch compile code, debuging
@lengyue233 any updates?
We had some fix in the repo that fix some kernel error. Maybe give it a try?
Example of the second issue.
Used API request parameters:
https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena/blob/main/app.py#L291-L304
@lengyue233
Managed to get the second issue twice on this FS space no matter if Maximum tokens per batch
is 0 or 1024. Like 5% chance of occurring.
I can't believe it, we won the championship! This is the best day ever!
Did you specify any prompt audio?
Yes. English.wav
New one; Fish Speech 1.5"I'm absolutely devastated, my world has fallen apart."
Used params: https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena/blob/2dfe855f31bfe121cd2c671e6c30e32883332d4e/app.py#L411-L424