Spaces:
Runtime error
Compared to later updates, commit "1acaa19" provides exceptional Japanese subtitles and transcripts.
On commit "1acaa19", I discovered that the transcription was significantly better compared to the later updates. There is almost no untranscribed text between pauses, and it generally captures more text than the later updates. The issue with the "ご視聴ありがとうございました!" thing has also been mostly resolved in this commit. However, when I tried the later updates, I noticed that the transcription quality was significantly worse. I am unsure why this is the case, but I wanted to bring this problem to your attention. To reproduce my results, I am using large-v1, as I have found that the results are better than using large-v2. Everything else is default, except for VAD - Max Merge Size (s), which is set to 40. This should allow you to recreate my results. Anyway, I just wanted to inform you of this issue. Please let me know if you find out what's the problem.
Edit: I also forgot to mention that many of the timings have been fixed lol.
So I tried running both versions (1acaa19 and b59dd62) on this YouTube video, but I'd say the results were initially a bit hard to interpret:
- Gist that contains both combined.srt, log-1acaa19.txt and log-b59dd62.txt
- Note that I used this script to combine the SRT files.
1
00:00:00,000 --> 00:00:04,000
A: じゃあね、ちょっと自己紹介してもらおうと思いますので
2
00:00:00,000 --> 00:00:04,000
B: じゃあね、ちょっと自己紹介してもらおうと思いますので
3
00:00:04,000 --> 00:00:06,640
A: 2022年のエピソードと共に
4
00:00:04,000 --> 00:00:06,640
B: 2022年のエピソードと共に
5
00:00:06,640 --> 00:00:08,840
A: お願いいたします、ということで
6
00:00:06,640 --> 00:00:08,840
B: お願いいたします、ということで
They're identical until the segment 03:10.394 to 03:34.106, when they start to diverge:
167
00:03:10,394 --> 00:03:11,394
A: そっかなぁちょっと
168
00:03:10,394 --> 00:03:11,854
B: わからないww
169
00:03:11,394 --> 00:03:14,394
A: みんななんかぜひやらかしいエピソードありますか?
170
00:03:11,934 --> 00:03:14,094
B: ねーねー τι壊しいエピソードありますか?
# ...
179
00:03:22,394 --> 00:03:24,394
A: 自分のねレインボーとともに寝てたこと
180
00:03:22,914 --> 00:03:25,054
B: 自分のねレインボーと chimney
But they then go back to be (mostly) the same.
B (b59dd62) is worse with random incorrect English words, but when I tried to extract the segment and run it directly, the difference was much harder to quantify:
However, when looking at the diff between -1acaa19 and -b59dd62, I notice that the default value for "patience" was changed from NULL (0) to 1, as this the original default in Whisper. And when I re-ran b59dd62 with "patience" set to 0 (which you can do in the "Full" interface), I see that I get relatively similar results as 1acaa19:
So, you may want to try setting the "Patience - Zero temperature" value to 0 directly yourself, and see if it improves the output in your case. But the default in Whisper is 1, so I'll have to investigate this some more to see if I ought to change the default from 1 to 0.
Okay, so I retested it. I did a fresh Git pull from your project and retested it. It did indeed produce very similar results, which leads me to believe that the issue was most likely on my part, possibly due to something I edited. Now, the quality is very similar with minor changes/mistakes, but nothing game-breaking like before. Previously, there was a high rate of untranscribed text, but that's not the case anymore. Anyways, I want to thank you for making this amazing tool. You did a killer job on it! also changing the "Patience - Zero temperature" to 0 did indeed give a small but noticeable quality boost:)