metadata

language:
  - mn
base_model: openai/whisper-medium
library_name: transformers
datasets:
  - mozilla-foundation/common_voice_17_0
  - google/fleurs
tags:
  - audio
  - automatic-speech-recognition
widget:
  - example_title: Common Voice sample 1
    src: sample1.flac
  - example_title: Common Voice sample 2
    src: sample2.flac
model-index:
  - name: whisper-medium-mn
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 17.0
          type: common_voice_17_0
          config: mn
          split: test
          args:
            language: mn
        metrics:
          - name: Test WER
            type: wer
            value: 12.958
pipeline_tag: automatic-speech-recognition
license: apache-2.0

Whisper Medium Mn - Erkhembayar Gantulga

This model is a fine-tuned version of openai/whisper-medium on the Common Voice 17.0 and Google Fleurs datasets. It achieves the following results on the evaluation set:

Loss: 0.1083
Wer: 12.9580

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

Datasets used for training:

For training, combined Common Voice 17.0 and Google Fleurs datasets:

from datasets import load_dataset, DatasetDict, concatenate_datasets
from datasets import Audio

common_voice = DatasetDict()

common_voice["train"] = load_dataset("mozilla-foundation/common_voice_17_0", "mn", split="train+validation+validated", use_auth_token=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_17_0", "mn", split="test", use_auth_token=True)

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

common_voice = common_voice.remove_columns(
    ["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes", "variant"]
)

google_fleurs = DatasetDict()

google_fleurs["train"] = load_dataset("google/fleurs", "mn_mn", split="train+validation", use_auth_token=True)
google_fleurs["test"] = load_dataset("google/fleurs", "mn_mn", split="test", use_auth_token=True)

google_fleurs = google_fleurs.remove_columns(
    ["id", "num_samples", "path", "raw_transcription", "gender", "lang_id", "language", "lang_group_id"]
)
google_fleurs = google_fleurs.rename_column("transcription", "sentence")

dataset = DatasetDict()
dataset["train"] = concatenate_datasets([common_voice["train"], google_fleurs["train"]])
dataset["test"] = concatenate_datasets([common_voice["test"], google_fleurs["test"]])

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 16
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 100
training_steps: 4000
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
0.2986	0.4912	500	0.3557	40.1515
0.2012	0.9823	1000	0.2310	28.3512
0.099	1.4735	1500	0.1864	23.4453
0.0733	1.9646	2000	0.1405	18.3024
0.0231	2.4558	2500	0.1308	16.5645
0.0191	2.9470	3000	0.1155	14.5569
0.0059	3.4381	3500	0.1122	13.4728
0.006	3.9293	4000	0.1083	12.9580

Framework versions

Transformers 4.44.0
Pytorch 2.3.1+cu121
Datasets 2.21.0
Tokenizers 0.19.1