|
--- |
|
base_model: openai/whisper-small |
|
library_name: peft |
|
license: mit |
|
tags: |
|
- whisper-small |
|
- speech_to_text |
|
- ASR |
|
- french |
|
language: |
|
- fr |
|
demo: https://huggingface.co/spaces/visalkao/whisper-small-french-finetuned |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
|
|
- **Developed by:** Visal KAO |
|
- **Model type:** Speech Recognition |
|
- **Language(s) (NLP):** French |
|
- **License:** MIT |
|
- **Finetuned from model :** Whisper-small |
|
|
|
### Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** openai/whisper-small |
|
## Dataset |
|
This model is finetuned on 50% of French Single Speaker Speech Dataset on kaggle (Only lesmis). |
|
- **Link to dataset :** (https://www.kaggle.com/datasets/bryanpark/french-single-speaker-speech-dataset) |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
The goal of this project is to finetune whisper-small model to improve its accuracy for french transcription. |
|
|
|
The reason why I chose Whisper-small is due to its size and versatility. My primary objective is to build/finetune a small model to get acceptable results. |
|
### Direct Use |
|
|
|
**Live Demo :** https://huggingface.co/spaces/visalkao/whisper-small-french-finetuned |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
As this model has less than 250 millions parameters, which is quite small considering its objective is to transcribe speech, it also has its own limitation. |
|
|
|
The Word Error Rate (WER) of this finetuned model is approximately 0.17 (17%). |
|
|
|
For reference, the original Whisper-small's WER is around 0.27 (27%) on the same dataset. |
|
|
|
|
|
|
|
|
|
## Training Hyperparameters |
|
This model is trained using LoRa with these hyperparamters: |
|
|
|
* per_device_train_batch_size=3, |
|
* gradient_accumulation_steps=1, |
|
* learning_rate=1e-3, |
|
* num_train_epochs=7, |
|
* evaluation_strategy="epoch", |
|
* fp16=True, |
|
* per_device_eval_batch_size=1, |
|
* generation_max_length=225, |
|
* logging_steps=10, |
|
* remove_unused_columns=False, |
|
* label_names=["labels"], |
|
* predict_with_generate=True, |
|
|
|
|
|
## Results |
|
|
|
Before finetuning, The Word Error Rate on this dataset is approximately 0.27. |
|
|
|
After finetuning, it drops down 0.1 to 0.17 or 17% wer (On testing data). |
|
|
|
Here is the training log: |
|
|
|
| Epoch | Training Loss | Validation Loss | WER | |
|
|-------|--------------|----------------|------------| |
|
| 1 | 0.369600 | 0.404414 | 26.665379 | |
|
| 2 | 0.273200 | 0.361762 | 22.793976 | |
|
| 3 | 0.308800 | 0.344289 | 24.454528 | |
|
| 4 | 0.131600 | 0.318023 | 21.847847 | |
|
| 5 | 0.117400 | 0.311023 | 19.134968 | |
|
| 6 | 0.035700 | 0.301410 | 18.922572 | |
|
| 7 | 0.013900 | 0.315151 | 16.972388 | |