File size: 2,531 Bytes
ac4a245 21e5e2d ac4a245 21e5e2d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
---
license: mit
datasets:
- tobiolatunji/afrispeech-200
language:
- en
metrics:
- wer
library_name: transformers
pipeline_tag: automatic-speech-recognition
finetuned_from: openai/whisper-small
tasks: automatic-speech-recognition
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
---
# Whisper Small Model Card
<!-- Provide a quick summary of what the model is/does. -->
Whisper Small is a pre-trained model for automatic speech recognition (ASR) and speech translation.
It is a Transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model.
It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision.
The model has 244 million parameters and is multilingual
##### Performance
Whisper Small has a high accuracy and can generalize well to many datasets and domains without the need for fine-tuning.
##### Usage
To transcribe audio samples, the model has to be used alongside a WhisperProcessor.
The WhisperProcessor is used to pre-process the audio inputs (converting them to log-Mel spectrograms for the model)
and post-process the model outputs (converting them from tokens to text).
##### References
- ** https://huggingface.co/openai/whisper-small
- ** https://github.com/openai/whisper
- ** https://openai.com/research/whisper
- ** https://www.assemblyai.com/blog/how-to-run-openais-whisper-speech-recognition-model/
## Model Details
Whisper is a transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model.
It was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper large-v2.
The models were trained on either English-only data or multilingual data.
The English-only models were trained on the task of speech recognition.
The multilingual models were trained on both speech recognition and speech translation.
For speech recognition, the model predicts transcriptions in the same language as the audio.
For speech translation, the model predicts transcriptions to a different language to the audio.
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
- Transcription
- Translation
## Training hyperparameters
<!-- Relevant interpretability work for the model goes here -->
- learning_rate: 1e-5
- train_batch_size: 8
- eval_batch_size: 8
- lr_scheduler_warmup_steps: 500
- max_steps: 4000
- metric_for_best_model: wer |