Update README
Browse files- app.py +1 -1
- docs/options.md +23 -3
app.py
CHANGED
|
@@ -209,7 +209,7 @@ def create_ui(inputAudioMaxDuration, share=False, server_name: str = None):
|
|
| 209 |
ui_description += " audio and is also a multi-task model that can perform multilingual speech recognition "
|
| 210 |
ui_description += " as well as speech translation and language identification. "
|
| 211 |
|
| 212 |
-
ui_description += "\n\n\n\nFor longer audio files (>10 minutes), it is recommended that you select Silero VAD (Voice Activity Detector) in the VAD option."
|
| 213 |
|
| 214 |
if inputAudioMaxDuration > 0:
|
| 215 |
ui_description += "\n\n" + "Max audio file length: " + str(inputAudioMaxDuration) + " s"
|
|
|
|
| 209 |
ui_description += " audio and is also a multi-task model that can perform multilingual speech recognition "
|
| 210 |
ui_description += " as well as speech translation and language identification. "
|
| 211 |
|
| 212 |
+
ui_description += "\n\n\n\nFor longer audio files (>10 minutes) not in English, it is recommended that you select Silero VAD (Voice Activity Detector) in the VAD option."
|
| 213 |
|
| 214 |
if inputAudioMaxDuration > 0:
|
| 215 |
ui_description += "\n\n" + "Max audio file length: " + str(inputAudioMaxDuration) + " s"
|
docs/options.md
CHANGED
|
@@ -33,11 +33,23 @@ the URL.
|
|
| 33 |
Select the task - either "transcribe" to transcribe the audio to text, or "translate" to translate it to English.
|
| 34 |
|
| 35 |
## Vad
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
* none
|
| 37 |
* Run whisper on the entire audio input
|
| 38 |
* silero-vad
|
| 39 |
-
* Use Silero VAD to detect sections that contain speech, and run
|
| 40 |
-
on the gaps between each speech section
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
* silero-vad-skip-gaps
|
| 42 |
* As above, but sections that doesn't contain speech according to Silero will be skipped. This will be slightly faster, but
|
| 43 |
may cause dialogue to be skipped.
|
|
@@ -55,4 +67,12 @@ Disables merging of adjacent speech sections if they are this number of seconds
|
|
| 55 |
The number of seconds (floating point) to add to the beginning and end of each speech section. Setting this to a number
|
| 56 |
larger than zero ensures that Whisper is more likely to correctly transcribe a sentence in the beginning of
|
| 57 |
a speech section. However, this also increases the probability of Whisper assigning the wrong timestamp
|
| 58 |
-
to each transcribed line. The default value is 1 second.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
Select the task - either "transcribe" to transcribe the audio to text, or "translate" to translate it to English.
|
| 34 |
|
| 35 |
## Vad
|
| 36 |
+
Using a VAD will improve the timing accuracy of each transcribed line, as well as prevent Whisper getting into an infinite
|
| 37 |
+
loop detecting the same sentence over and over again. The downside is that this may be at a cost to text accuracy, especially
|
| 38 |
+
with regards to unique words or names that appear in the audio. You can compensate for this by increasing the prompt window.
|
| 39 |
+
|
| 40 |
+
Note that English is very well handled by Whisper, and it's less susceptible to issues surrounding bad timings and infinite loops.
|
| 41 |
+
So you may only need to use a VAD for other languages, such as Japanese, or when the audio is very long.
|
| 42 |
+
|
| 43 |
* none
|
| 44 |
* Run whisper on the entire audio input
|
| 45 |
* silero-vad
|
| 46 |
+
* Use Silero VAD to detect sections that contain speech, and run Whisper on independently on each section. Whisper is also run
|
| 47 |
+
on the gaps between each speech section, by either expanding the section up to the max merge size, or running Whisper independently
|
| 48 |
+
on the non-speech section.
|
| 49 |
+
* silero-vad-expand-into-gaps
|
| 50 |
+
* Use Silero VAD to detect sections that contain speech, and run Whisper on independently on each section. Each spech section will be expanded
|
| 51 |
+
such that they cover any adjacent non-speech sections. For instance, if an audio file of one minute contains the speech sections
|
| 52 |
+
00:00 - 00:10 (A) and 00:30 - 00:40 (B), the first section (A) will be expanded to 00:00 - 00:30, and (B) will be expanded to 00:30 - 00:60.
|
| 53 |
* silero-vad-skip-gaps
|
| 54 |
* As above, but sections that doesn't contain speech according to Silero will be skipped. This will be slightly faster, but
|
| 55 |
may cause dialogue to be skipped.
|
|
|
|
| 67 |
The number of seconds (floating point) to add to the beginning and end of each speech section. Setting this to a number
|
| 68 |
larger than zero ensures that Whisper is more likely to correctly transcribe a sentence in the beginning of
|
| 69 |
a speech section. However, this also increases the probability of Whisper assigning the wrong timestamp
|
| 70 |
+
to each transcribed line. The default value is 1 second.
|
| 71 |
+
|
| 72 |
+
## VAD - Prompt Window (s)
|
| 73 |
+
The text of a detected line will be included as a prompt to the next speech section, if the speech section starts at most this
|
| 74 |
+
number of seconds after the line has finished. For instance, if a line ends at 10:00, and the next speech section starts at
|
| 75 |
+
10:04, the line's text will be included if the prompt window is 4 seconds or more (10:04 - 10:00 = 4 seconds).
|
| 76 |
+
|
| 77 |
+
Note that detected lines in gaps between speech sections will not be included in the prompt
|
| 78 |
+
(if silero-vad or silero-vad-expand-into-gaps) is used.
|