|
--- |
|
license: mit |
|
datasets: |
|
- tobiolatunji/afrispeech-200 |
|
language: |
|
- en |
|
metrics: |
|
- wer |
|
library_name: transformers |
|
pipeline_tag: automatic-speech-recognition |
|
finetuned_from: openai/whisper-small |
|
tasks: automatic-speech-recognition |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
- hf-asr-leaderboard |
|
--- |
|
# Whisper Small Model Card |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
Whisper Small is a pre-trained model for automatic speech recognition (ASR) and speech translation. |
|
It is a Transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model. |
|
It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision. |
|
The model has 244 million parameters and is multilingual |
|
|
|
##### Performance |
|
Whisper Small has a high accuracy and can generalize well to many datasets and domains without the need for fine-tuning. |
|
|
|
##### Usage |
|
To transcribe audio samples, the model has to be used alongside a WhisperProcessor. |
|
The WhisperProcessor is used to pre-process the audio inputs (converting them to log-Mel spectrograms for the model) |
|
and post-process the model outputs (converting them from tokens to text). |
|
|
|
##### References |
|
- ** https://huggingface.co/openai/whisper-small |
|
- ** https://github.com/openai/whisper |
|
- ** https://openai.com/research/whisper |
|
- ** https://www.assemblyai.com/blog/how-to-run-openais-whisper-speech-recognition-model/ |
|
|
|
|
|
## Model Details |
|
Whisper is a transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model. |
|
It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper large-v2. |
|
|
|
The models were trained on either English-only data or multilingual data. |
|
The English-only models were trained on the task of speech recognition. |
|
The multilingual models were trained on both speech recognition and speech translation. |
|
For speech recognition, the model predicts transcriptions in the same language as the audio. |
|
For speech translation, the model predicts transcriptions to a different language to the audio. |
|
|
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
- Transcription |
|
- Translation |
|
|
|
|
|
## Training hyperparameters |
|
<!-- Relevant interpretability work for the model goes here --> |
|
- learning_rate: 1e-5 |
|
- train_batch_size: 8 |
|
- eval_batch_size: 8 |
|
- lr_scheduler_warmup_steps: 500 |
|
- max_steps: 4000 |
|
- metric_for_best_model: wer |