Whisper-Small (el) for Transcription
This model is a fine-tuned version of openai/whisper-small on the mozilla-foundation/common_voice_11_0 el dataset. It achieves the following results on the evaluation set:
- Loss: 0.4805
- Wer: 20.6352
Model description
This model is trained for transcription on the Greek subset on mozilla-foundation/common_voice_11_0 interleaved splits train+eval
Intended uses & limitations
This is part of the Whisper Finetuning Event (December 2022)
Training and evaluation data
Training used interleaved splits: train + evaluation. Evaluation was done on the test split. Data was streamed from Hugging Face's Hub.
Training procedure
The script used has been uploaded in the files of this space The command to run it was:
python ./run_speech_recognition_seq2seq_streaming.py \
--model_name_or_path "openai/whisper-small" \
--model_revision "main" \
--do_train True \
--do_eval True \
--use_auth_token False \
--freeze_encoder False \
--model_index_name "whisper-sm-el-xs" \
--dataset_name "mozilla-foundation/common_voice_11_0" \
--dataset_config_name "el" \
--audio_column_name "audio" \
--text_column_name "sentence" \
--max_duration_in_seconds 30 \
--train_split_name "train+validation" \
--eval_split_name "test" \
--do_lower_case False \
--do_remove_punctuation False \
--do_normalize_eval True \
--language "greek" \
--task "transcribe" \
--shuffle_buffer_size 500 \
--output_dir "./data/finetuningRuns/whisper-sm-el-xs" \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--learning_rate 1e-5 \
--warmup_steps 500 \
--max_steps 5000 \
--gradient_checkpointing True \
--fp16 True \
--evaluation_strategy "steps" \
--per_device_eval_batch_size 8 \
--predict_with_generate True \
--generation_max_length 225 \
--save_steps 1000 \
--eval_steps 1000 \
--logging_steps 25 \
--report_to "tensorboard" \
--load_best_model_at_end True \
--metric_for_best_model "wer" \
--greater_is_better False \
--push_to_hub False \
--overwrite_output_dir True
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 5000
- mixed_precision_training: Native AMP
Training results
Training Loss | Epoch | Step | Validation Loss | Wer |
---|---|---|---|---|
0.0024 | 18.01 | 1000 | 0.4246 | 21.0438 |
0.0003 | 37.01 | 2000 | 0.4805 | 20.6352 |
0.0001 | 56.01 | 3000 | 0.5102 | 20.8395 |
0.0001 | 75.0 | 4000 | 0.5296 | 21.0717 |
0.0001 | 94.0 | 5000 | 0.5375 | 21.0253 |
Here is the summary from the log of the run:
***** train metrics *****
epoch = 94.0
train_loss = 0.0222
train_runtime = 23:06:13.19
train_samples_per_second = 3.847
train_steps_per_second = 0.06
12/08/2022 11:20:17 - INFO - __main__ - *** Evaluate ***
***** eval metrics *****
epoch = 94.0
eval_loss = 0.4805
eval_runtime = 0:23:03.68
eval_samples_per_second = 1.226
eval_steps_per_second = 0.153
eval_wer = 20.6352
Thu 08 Dec 2022 11:43:22 AM EST
Framework versions
- Transformers 4.26.0.dev0
- Pytorch 1.13.0
- Datasets 2.7.1.dev0
- Tokenizers 0.12.1
- Downloads last month
- 11
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Dataset used to train farsipal/whisper-sm-el-xs
Evaluation results
- Wer on mozilla-foundation/common_voice_11_0 eltest set self-reported20.635