whisper-webui3

Paused

App Files Files Community

aadnk commited on Dec 7, 2022

Commit

07447cb

1 Parent(s): 530547e

Update README

Browse files

Files changed (2) hide show

README.md +8 -2
docs/options.md +20 -12

README.md CHANGED Viewed

@@ -76,6 +76,12 @@ cores (up to 8):
 python app.py --input_audio_max_duration -1 --auto_parallel True
 ```
 # Docker
 To run it in Docker, first install Docker and optionally the NVIDIA Container Toolkit in order to use the GPU.
@@ -109,7 +115,7 @@ You can also pass custom arguments to `app.py` in the Docker container, for inst
 sudo docker run -d --gpus all -p 7860:7860 \
 --mount type=bind,source=/home/administrator/.cache/whisper,target=/root/.cache/whisper \
 --restart=on-failure:15 registry.gitlab.com/aadnk/whisper-webui:latest \
-app.py --input_audio_max_duration -1 --server_name 0.0.0.0 --vad_parallel_devices 0,1 \
 --default_vad silero-vad --default_model_name large
 ```
@@ -119,7 +125,7 @@ sudo docker run --gpus all \
 --mount type=bind,source=/home/administrator/.cache/whisper,target=/root/.cache/whisper \
 --mount type=bind,source=${PWD},target=/app/data \
 registry.gitlab.com/aadnk/whisper-webui:latest \
-cli.py --model large --vad_parallel_devices 0,1 --vad silero-vad \
 --output_dir /app/data /app/data/YOUR-FILE-HERE.mp4
 ```

 python app.py --input_audio_max_duration -1 --auto_parallel True
 ```
+### Multiple Files
+You can upload multiple files either through the "Upload files" option, or as a playlist on YouTube.
+Each audio file will then be processed in turn, and the resulting SRT/VTT/Transcript will be made available in the "Download" section.
+When more than one file is processed, the UI will also generate a "All_Output" zip file containing all the text output files.
 # Docker
 To run it in Docker, first install Docker and optionally the NVIDIA Container Toolkit in order to use the GPU.
 sudo docker run -d --gpus all -p 7860:7860 \
 --mount type=bind,source=/home/administrator/.cache/whisper,target=/root/.cache/whisper \
 --restart=on-failure:15 registry.gitlab.com/aadnk/whisper-webui:latest \
+app.py --input_audio_max_duration -1 --server_name 0.0.0.0 --auto_parallel True \
 --default_vad silero-vad --default_model_name large
 ```
 --mount type=bind,source=/home/administrator/.cache/whisper,target=/root/.cache/whisper \
 --mount type=bind,source=${PWD},target=/app/data \
 registry.gitlab.com/aadnk/whisper-webui:latest \
+cli.py --model large --auto_parallel True --vad silero-vad \
 --output_dir /app/data /app/data/YOUR-FILE-HERE.mp4
 ```

docs/options.md CHANGED Viewed

@@ -3,18 +3,19 @@ To transcribe or translate an audio file, you can either copy an URL from a webs
 supported by YT-DLP will work, including YouTube). Otherwise, upload an audio file (choose "All Files (*.*)"
 in the file selector to select any file type, including video files) or use the microphone.
-For longer audio files (>10 minutes), it is recommended that you select Silero VAD (Voice Activity Detector) in the VAD option.
 ## Model
 Select the model that Whisper will use to transcribe the audio:
-| Size   | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
-|--------|------------|--------------------|--------------------|---------------|----------------|
-| tiny   | 39 M       | tiny.en            | tiny               | ~1 GB         | ~32x           |
-| base   | 74 M       | base.en            | base               | ~1 GB         | ~16x           |
-| small  | 244 M      | small.en           | small              | ~2 GB         | ~6x            |
-| medium | 769 M      | medium.en          | medium             | ~5 GB         | ~2x            |
-| large  | 1550 M     | N/A                | large              | ~10 GB        | 1x             |
 ## Language
@@ -24,10 +25,12 @@ Note that if the selected language and the language in the audio differs, Whispe
 language. For instance, if the audio is in English but you select Japaneese, the model may translate the audio to Japanese.
 ## Inputs
-The options "URL (YouTube, etc.)", "Upload Audio" or "Micriphone Input" allows you to send an audio input to the model.
-Note that the UI will only process the first valid input - i.e. if you enter both an URL and upload an audio, it will only process
-the URL.
 ## Task
 Select the task - either "transcribe" to transcribe the audio to text, or "translate" to translate it to English.
@@ -75,4 +78,9 @@ number of seconds after the line has finished. For instance, if a line ends at 1
 10:04, the line's text will be included if the prompt window is 4 seconds or more (10:04 - 10:00 = 4 seconds).
 Note that detected lines in gaps between speech sections will not be included in the prompt
-(if silero-vad or silero-vad-expand-into-gaps) is used.

 supported by YT-DLP will work, including YouTube). Otherwise, upload an audio file (choose "All Files (*.*)"
 in the file selector to select any file type, including video files) or use the microphone.
+For longer audio files (>10 minutes), it is recommended that you select Silero VAD (Voice Activity Detector) in the VAD option, especially if you are using the `large-v1` model. Note that `large-v2` is a lot more forgiving, but you may still want to use a VAD with a slightly higher "VAD - Max Merge Size (s)" (60 seconds or more).
 ## Model
 Select the model that Whisper will use to transcribe the audio:
+| Size      | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
+|-----------|------------|--------------------|--------------------|---------------|----------------|
+| tiny      | 39 M       | tiny.en            | tiny               | ~1 GB         | ~32x           |
+| base      | 74 M       | base.en            | base               | ~1 GB         | ~16x           |
+| small     | 244 M      | small.en           | small              | ~2 GB         | ~6x            |
+| medium    | 769 M      | medium.en          | medium             | ~5 GB         | ~2x            |
+| large     | 1550 M     | N/A                | large              | ~10 GB        | 1x             |
+| large-v2  | 1550 M     | N/A                | large              | ~10 GB        | 1x             |
 ## Language
 language. For instance, if the audio is in English but you select Japaneese, the model may translate the audio to Japanese.
 ## Inputs
+The options "URL (YouTube, etc.)", "Upload Files" or "Micriphone Input" allows you to send an audio input to the model.
+### Multiple Files
+Note that the UI will only process either the given URL or the upload files (including microphone) - not both.
+But you can upload multiple files either through the "Upload files" option, or as a playlist on YouTube. Each audio file will then be processed in turn, and the resulting SRT/VTT/Transcript will be made available in the "Download" section. When more than one file is processed, the UI will also generate a "All_Output" zip file containing all the text output files.
 ## Task
 Select the task - either "transcribe" to transcribe the audio to text, or "translate" to translate it to English.
 10:04, the line's text will be included if the prompt window is 4 seconds or more (10:04 - 10:00 = 4 seconds).
 Note that detected lines in gaps between speech sections will not be included in the prompt
+(if silero-vad or silero-vad-expand-into-gaps) is used.
+# Command Line Options
+Both `app.py` and `cli.py` also accept command line options, such as the ability to enable parallel execution on multiple
+CPU/GPU cores, the default model name/VAD and so on. Consult the README in the root folder for more information.