File size: 7,779 Bytes
1388a7f dbe2b3c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
---
language: ar
license: mit
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- whisper
- speechbrain
- arabic
- egyptian-arabic
- speech-to-text
- asr
datasets:
- MAdel121/arabic-egy-cleaned
metrics:
- wer
- cer
model-index:
- name: whisper-small-egyptian-arabic
results:
- task:
type: automatic-speech-recognition
name: Speech Recognition
dataset:
name: MAdel121/arabic-egy-cleaned (Test Split)
type: MAdel121/arabic-egy-cleaned
config: default # Assuming default config
split: test
metrics:
- type: wer
value: 22.687923567389824
name: Test WER
- type: cer
value: 16.69961390157474
name: Test CER
---
# Whisper Small - Fine-tuned for Egyptian Arabic ASR
This repository contains a fine-tuned version of the `openai/whisper-small` model for Automatic Speech Recognition (ASR) specifically targeting the **Egyptian Arabic dialect**.
The model was fine-tuned using the [SpeechBrain](https://github.com/speechbrain/speechbrain) toolkit on the `MAdel121/arabic-egy-cleaned` dataset.
## Model Description
* **Base Model:** [openai/whisper-small](https://huggingface.co/openai/whisper-small)
* **Language:** Arabic (`ar`)
* **Task:** Transcription
* **Fine-tuning Framework:** SpeechBrain
* **Dataset:** [MAdel121/arabic-egy-cleaned](https://huggingface.co/datasets/MAdel121/arabic-egy-cleaned)
## Intended Uses & Limitations
This model is intended for transcribing speech in the **Egyptian Arabic dialect**.
**Limitations:**
* Performance may degrade significantly on other Arabic dialects.
* Performance on noisy audio may vary, as only specific augmentations (DropChunk, DropFreq, DropBitResolution) were used during training.
* The model might perform less effectively on highly specialized domains or topics not present in the fine-tuning dataset.
## How to Use
You can use this model directly with the `transformers` library pipeline for automatic speech recognition. Ensure you have `transformers` and `torch` installed (`pip install transformers torch`).
```python
from transformers import pipeline
import torch
# Ensure you have ffmpeg installed for audio processing
# pip install -U ffmpeg-python # or install via system package manager
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# Replace "your-username/whisper-small-egyptian-arabic" with the actual model ID on the Hub
pipe = pipeline(
"automatic-speech-recognition",
model="your-username/whisper-small-egyptian-arabic", # <<< Replace this
device=device
)
# Load your audio file (requires ffmpeg)
# For local files:
audio_file = "/path/to/your/egyptian_arabic_audio.wav"
result = pipe(audio_file, chunk_length_s=30, batch_size=8) # Adjust batch_size based on GPU memory
# For datasets library audio:
# from datasets import load_dataset
# ds = load_dataset("MAdel121/arabic-egy-cleaned", "default", split="test") # Example
# sample = ds[0]["audio"]
# result = pipe(sample.copy()) # Pass a copy to avoid modifying original
print(result["text"])
# --- Using AutoModelForSpeechSeq2Seq ---
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
# Load the processor and model (replace with your model ID)
model_id = "your-username/whisper-small-egyptian-arabic" # <<< Replace this with your dataset file on hugging face
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
# Load and preprocess audio
waveform, sample_rate = torchaudio.load(audio_file)
if sample_rate != processor.feature_extractor.sampling_rate:
resampler = torchaudio.transforms.Resample(sample_rate, processor.feature_extractor.sampling_rate)
waveform = resampler(waveform)
input_features = processor(waveform.squeeze().numpy(), sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt").input_features.to(device)
# Generate transcription
# Set forced_decoder_ids for Arabic transcription
forced_decoder_ids = processor.get_decoder_prompt_ids(language="ar", task="transcribe")
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
```
**Note:** The original checkpoint was saved using SpeechBrain. This README assumes the model has been converted to the standard Hugging Face Transformers format for hosting and usage with the `pipeline` or `AutoModel` classes. If you are using the original `.ckpt` file, refer to the project's main `README.md` and the `infer_whisper_local.py` script for loading instructions.
## Training Data
The model was fine-tuned on the **`MAdel121/arabic-egy-cleaned`** dataset available on the Hugging Face Hub. This dataset contains cleaned audio samples and corresponding transcriptions in Egyptian Arabic.
## Training Procedure
* **Framework:** SpeechBrain (`speechbrain==1.0.3`) with Hugging Face Transformers (`transformers==4.51.3`) and Accelerate (`accelerate==0.25.0`).
* **Base Model:** `openai/whisper-small`
* **Dataset:** `MAdel121/arabic-egy-cleaned`
* **Epochs:** 10
* **Optimizer:** AdamW (`lr=1e-5`, `weight_decay=0.05`)
* **LR Scheduler:** NewBob (`improvement_threshold=0.0025`, `annealing_factor=0.9`, `patient=0`)
* **Warmup Steps:** 1000
* **Batch Size:** 8 (fixed, no dynamic batching)
* **Gradient Accumulation:** 2 steps (effective batch size: 16)
* **Gradient Clipping:** Max norm 5.0
* **Mixed Precision:** Not explicitly mentioned, assumed FP32 or handled by Accelerate/Trainer.
* **Augmentation:** Enabled (`augment_prob_master=0.5`, `min_augmentations=1`, `max_augmentations=3`) with the following techniques applied randomly from the pool:
* DropChunk (`length: 1600-4800 samples`, `count: 1-5`)
* DropFreq (`count: 1-3`)
* DropBitResolution
* **Training Environment:** Modal Labs (`gpu=A100-40GB`)
## Evaluation Results
The model was evaluated on the **test split** of the `MAdel121/arabic-egy-cleaned` dataset.
| Metric | Value (%) |
| :----- | :-------- |
| WER | 22.69 |
| CER | 16.70 |
*WER (Word Error Rate) and CER (Character Error Rate) are reported. Lower is better.*
Validation metrics at the end of training (Epoch 10):
* Validation WER: 22.79%
* Validation CER: 16.76%
## Citation
If you use this model, please consider citing the original Whisper paper and the dataset used:
```bibtex
@article{radford2023robust,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal={arXiv preprint arXiv:2212.04356},
year={2023}
}
@misc{adel_mohamed_2024_12860997,
author = {Adel Mohamed},
title = {MAdel121/arabic-egy-cleaned},
month = jun,
year = 2024,
publisher = {Zenodo},
doi = {10.5281/zenodo.12860997},
url = {https://doi.org/10.5281/zenodo.12860997}
}
@misc{speechbrain,
title={{SpeechBrain}: A General-Purpose Speech Toolkit},
author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
year={2021},
eprint={2106.04624},
archivePrefix={arXiv},
primaryClass={eess.AS},
note={arXiv:2106.04624}
}
```
## Model Card Authors
[Your Name/Organization Here]
*(Based on training run `ceeu3g6c`)* |