MAdel121 commited on
Commit
1388a7f
·
verified ·
1 Parent(s): ba98bf9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +197 -197
README.md CHANGED
@@ -1,198 +1,198 @@
1
- ---
2
- language: ar
3
- license: mit
4
- library_name: transformers
5
- pipeline_tag: automatic-speech-recognition
6
- tags:
7
- - whisper
8
- - speechbrain
9
- - arabic
10
- - egyptian-arabic
11
- - speech-to-text
12
- - asr
13
- datasets:
14
- - MAdel121/arabic-egy-cleaned
15
- metrics:
16
- - wer
17
- - cer
18
- model-index:
19
- - name: whisper-small-egyptian-arabic
20
- results:
21
- - task:
22
- type: automatic-speech-recognition
23
- name: Speech Recognition
24
- dataset:
25
- name: MAdel121/arabic-egy-cleaned (Test Split)
26
- type: MAdel121/arabic-egy-cleaned
27
- config: default # Assuming default config
28
- split: test
29
- metrics:
30
- - type: wer
31
- value: 22.687923567389824
32
- name: Test WER
33
- - type: cer
34
- value: 16.69961390157474
35
- name: Test CER
36
- ---
37
-
38
- # Whisper Small - Fine-tuned for Egyptian Arabic ASR
39
-
40
- This repository contains a fine-tuned version of the `openai/whisper-small` model for Automatic Speech Recognition (ASR) specifically targeting the **Egyptian Arabic dialect**.
41
-
42
- The model was fine-tuned using the [SpeechBrain](https://github.com/speechbrain/speechbrain) toolkit on the `MAdel121/arabic-egy-cleaned` dataset.
43
-
44
- ## Model Description
45
-
46
- * **Base Model:** [openai/whisper-small](https://huggingface.co/openai/whisper-small)
47
- * **Language:** Arabic (`ar`)
48
- * **Task:** Transcription
49
- * **Fine-tuning Framework:** SpeechBrain
50
- * **Dataset:** [MAdel121/arabic-egy-cleaned](https://huggingface.co/datasets/MAdel121/arabic-egy-cleaned)
51
-
52
- ## Intended Uses & Limitations
53
-
54
- This model is intended for transcribing speech in the **Egyptian Arabic dialect**.
55
-
56
- **Limitations:**
57
-
58
- * Performance may degrade significantly on other Arabic dialects.
59
- * Performance on noisy audio may vary, as only specific augmentations (DropChunk, DropFreq, DropBitResolution) were used during training.
60
- * The model might perform less effectively on highly specialized domains or topics not present in the fine-tuning dataset.
61
-
62
- ## How to Use
63
-
64
- You can use this model directly with the `transformers` library pipeline for automatic speech recognition. Ensure you have `transformers` and `torch` installed (`pip install transformers torch`).
65
-
66
- ```python
67
- from transformers import pipeline
68
- import torch
69
-
70
- # Ensure you have ffmpeg installed for audio processing
71
- # pip install -U ffmpeg-python # or install via system package manager
72
-
73
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
74
-
75
- # Replace "your-username/whisper-small-egyptian-arabic" with the actual model ID on the Hub
76
- pipe = pipeline(
77
- "automatic-speech-recognition",
78
- model="your-username/whisper-small-egyptian-arabic", # <<< Replace this
79
- device=device
80
- )
81
-
82
- # Load your audio file (requires ffmpeg)
83
- # For local files:
84
- audio_file = "/path/to/your/egyptian_arabic_audio.wav"
85
- result = pipe(audio_file, chunk_length_s=30, batch_size=8) # Adjust batch_size based on GPU memory
86
-
87
- # For datasets library audio:
88
- # from datasets import load_dataset
89
- # ds = load_dataset("MAdel121/arabic-egy-cleaned", "default", split="test") # Example
90
- # sample = ds[0]["audio"]
91
- # result = pipe(sample.copy()) # Pass a copy to avoid modifying original
92
-
93
- print(result["text"])
94
-
95
- # --- Using AutoModelForSpeechSeq2Seq ---
96
- from transformers import WhisperProcessor, WhisperForConditionalGeneration
97
- import torchaudio
98
-
99
- # Load the processor and model (replace with your model ID)
100
- model_id = "your-username/whisper-small-egyptian-arabic" # <<< Replace this with your dataset file on hugging face
101
- processor = WhisperProcessor.from_pretrained(model_id)
102
- model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
103
-
104
- # Load and preprocess audio
105
- waveform, sample_rate = torchaudio.load(audio_file)
106
- if sample_rate != processor.feature_extractor.sampling_rate:
107
- resampler = torchaudio.transforms.Resample(sample_rate, processor.feature_extractor.sampling_rate)
108
- waveform = resampler(waveform)
109
-
110
- input_features = processor(waveform.squeeze().numpy(), sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt").input_features.to(device)
111
-
112
- # Generate transcription
113
- # Set forced_decoder_ids for Arabic transcription
114
- forced_decoder_ids = processor.get_decoder_prompt_ids(language="ar", task="transcribe")
115
- predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
116
-
117
- transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
118
- print(transcription[0])
119
- ```
120
-
121
- **Note:** The original checkpoint was saved using SpeechBrain. This README assumes the model has been converted to the standard Hugging Face Transformers format for hosting and usage with the `pipeline` or `AutoModel` classes. If you are using the original `.ckpt` file, refer to the project's main `README.md` and the `infer_whisper_local.py` script for loading instructions.
122
-
123
- ## Training Data
124
-
125
- The model was fine-tuned on the **`MAdel121/arabic-egy-cleaned`** dataset available on the Hugging Face Hub. This dataset contains cleaned audio samples and corresponding transcriptions in Egyptian Arabic.
126
-
127
- ## Training Procedure
128
-
129
- * **Framework:** SpeechBrain (`speechbrain==1.0.3`) with Hugging Face Transformers (`transformers==4.51.3`) and Accelerate (`accelerate==0.25.0`).
130
- * **Base Model:** `openai/whisper-small`
131
- * **Dataset:** `MAdel121/arabic-egy-cleaned`
132
- * **Epochs:** 10
133
- * **Optimizer:** AdamW (`lr=1e-5`, `weight_decay=0.05`)
134
- * **LR Scheduler:** NewBob (`improvement_threshold=0.0025`, `annealing_factor=0.9`, `patient=0`)
135
- * **Warmup Steps:** 1000
136
- * **Batch Size:** 8 (fixed, no dynamic batching)
137
- * **Gradient Accumulation:** 2 steps (effective batch size: 16)
138
- * **Gradient Clipping:** Max norm 5.0
139
- * **Mixed Precision:** Not explicitly mentioned, assumed FP32 or handled by Accelerate/Trainer.
140
- * **Augmentation:** Enabled (`augment_prob_master=0.5`, `min_augmentations=1`, `max_augmentations=3`) with the following techniques applied randomly from the pool:
141
- * DropChunk (`length: 1600-4800 samples`, `count: 1-5`)
142
- * DropFreq (`count: 1-3`)
143
- * DropBitResolution
144
- * **Training Environment:** Modal Cloud Platform (`gpu=A100-40GB`)
145
-
146
- ## Evaluation Results
147
-
148
- The model was evaluated on the **test split** of the `MAdel121/arabic-egy-cleaned` dataset.
149
-
150
- | Metric | Value (%) |
151
- | :----- | :-------- |
152
- | WER | 22.69 |
153
- | CER | 16.70 |
154
-
155
- *WER (Word Error Rate) and CER (Character Error Rate) are reported. Lower is better.*
156
-
157
- Validation metrics at the end of training (Epoch 10):
158
- * Validation WER: 22.79%
159
- * Validation CER: 16.76%
160
-
161
- ## Citation
162
-
163
- If you use this model, please consider citing the original Whisper paper and the dataset used:
164
-
165
- ```bibtex
166
- @article{radford2023robust,
167
- title={Robust Speech Recognition via Large-Scale Weak Supervision},
168
- author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
169
- journal={arXiv preprint arXiv:2212.04356},
170
- year={2023}
171
- }
172
-
173
- @misc{adel_mohamed_2024_12860997,
174
- author = {Adel Mohamed},
175
- title = {MAdel121/arabic-egy-cleaned},
176
- month = jun,
177
- year = 2024,
178
- publisher = {Zenodo},
179
- doi = {10.5281/zenodo.12860997},
180
- url = {https://doi.org/10.5281/zenodo.12860997}
181
- }
182
-
183
- @misc{speechbrain,
184
- title={{SpeechBrain}: A General-Purpose Speech Toolkit},
185
- author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
186
- year={2021},
187
- eprint={2106.04624},
188
- archivePrefix={arXiv},
189
- primaryClass={eess.AS},
190
- note={arXiv:2106.04624}
191
- }
192
- ```
193
-
194
- ## Model Card Authors
195
-
196
- [Your Name/Organization Here]
197
-
198
  *(Based on training run `ceeu3g6c`)*
 
1
+ ---
2
+ language: ar
3
+ license: mit
4
+ library_name: transformers
5
+ pipeline_tag: automatic-speech-recognition
6
+ tags:
7
+ - whisper
8
+ - speechbrain
9
+ - arabic
10
+ - egyptian-arabic
11
+ - speech-to-text
12
+ - asr
13
+ datasets:
14
+ - MAdel121/arabic-egy-cleaned
15
+ metrics:
16
+ - wer
17
+ - cer
18
+ model-index:
19
+ - name: whisper-small-egyptian-arabic
20
+ results:
21
+ - task:
22
+ type: automatic-speech-recognition
23
+ name: Speech Recognition
24
+ dataset:
25
+ name: MAdel121/arabic-egy-cleaned (Test Split)
26
+ type: MAdel121/arabic-egy-cleaned
27
+ config: default # Assuming default config
28
+ split: test
29
+ metrics:
30
+ - type: wer
31
+ value: 22.687923567389824
32
+ name: Test WER
33
+ - type: cer
34
+ value: 16.69961390157474
35
+ name: Test CER
36
+ ---
37
+
38
+ # Whisper Small - Fine-tuned for Egyptian Arabic ASR
39
+
40
+ This repository contains a fine-tuned version of the `openai/whisper-small` model for Automatic Speech Recognition (ASR) specifically targeting the **Egyptian Arabic dialect**.
41
+
42
+ The model was fine-tuned using the [SpeechBrain](https://github.com/speechbrain/speechbrain) toolkit on the `MAdel121/arabic-egy-cleaned` dataset.
43
+
44
+ ## Model Description
45
+
46
+ * **Base Model:** [openai/whisper-small](https://huggingface.co/openai/whisper-small)
47
+ * **Language:** Arabic (`ar`)
48
+ * **Task:** Transcription
49
+ * **Fine-tuning Framework:** SpeechBrain
50
+ * **Dataset:** [MAdel121/arabic-egy-cleaned](https://huggingface.co/datasets/MAdel121/arabic-egy-cleaned)
51
+
52
+ ## Intended Uses & Limitations
53
+
54
+ This model is intended for transcribing speech in the **Egyptian Arabic dialect**.
55
+
56
+ **Limitations:**
57
+
58
+ * Performance may degrade significantly on other Arabic dialects.
59
+ * Performance on noisy audio may vary, as only specific augmentations (DropChunk, DropFreq, DropBitResolution) were used during training.
60
+ * The model might perform less effectively on highly specialized domains or topics not present in the fine-tuning dataset.
61
+
62
+ ## How to Use
63
+
64
+ You can use this model directly with the `transformers` library pipeline for automatic speech recognition. Ensure you have `transformers` and `torch` installed (`pip install transformers torch`).
65
+
66
+ ```python
67
+ from transformers import pipeline
68
+ import torch
69
+
70
+ # Ensure you have ffmpeg installed for audio processing
71
+ # pip install -U ffmpeg-python # or install via system package manager
72
+
73
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
74
+
75
+ # Replace "your-username/whisper-small-egyptian-arabic" with the actual model ID on the Hub
76
+ pipe = pipeline(
77
+ "automatic-speech-recognition",
78
+ model="your-username/whisper-small-egyptian-arabic", # <<< Replace this
79
+ device=device
80
+ )
81
+
82
+ # Load your audio file (requires ffmpeg)
83
+ # For local files:
84
+ audio_file = "/path/to/your/egyptian_arabic_audio.wav"
85
+ result = pipe(audio_file, chunk_length_s=30, batch_size=8) # Adjust batch_size based on GPU memory
86
+
87
+ # For datasets library audio:
88
+ # from datasets import load_dataset
89
+ # ds = load_dataset("MAdel121/arabic-egy-cleaned", "default", split="test") # Example
90
+ # sample = ds[0]["audio"]
91
+ # result = pipe(sample.copy()) # Pass a copy to avoid modifying original
92
+
93
+ print(result["text"])
94
+
95
+ # --- Using AutoModelForSpeechSeq2Seq ---
96
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration
97
+ import torchaudio
98
+
99
+ # Load the processor and model (replace with your model ID)
100
+ model_id = "your-username/whisper-small-egyptian-arabic" # <<< Replace this with your dataset file on hugging face
101
+ processor = WhisperProcessor.from_pretrained(model_id)
102
+ model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
103
+
104
+ # Load and preprocess audio
105
+ waveform, sample_rate = torchaudio.load(audio_file)
106
+ if sample_rate != processor.feature_extractor.sampling_rate:
107
+ resampler = torchaudio.transforms.Resample(sample_rate, processor.feature_extractor.sampling_rate)
108
+ waveform = resampler(waveform)
109
+
110
+ input_features = processor(waveform.squeeze().numpy(), sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt").input_features.to(device)
111
+
112
+ # Generate transcription
113
+ # Set forced_decoder_ids for Arabic transcription
114
+ forced_decoder_ids = processor.get_decoder_prompt_ids(language="ar", task="transcribe")
115
+ predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
116
+
117
+ transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
118
+ print(transcription[0])
119
+ ```
120
+
121
+ **Note:** The original checkpoint was saved using SpeechBrain. This README assumes the model has been converted to the standard Hugging Face Transformers format for hosting and usage with the `pipeline` or `AutoModel` classes. If you are using the original `.ckpt` file, refer to the project's main `README.md` and the `infer_whisper_local.py` script for loading instructions.
122
+
123
+ ## Training Data
124
+
125
+ The model was fine-tuned on the **`MAdel121/arabic-egy-cleaned`** dataset available on the Hugging Face Hub. This dataset contains cleaned audio samples and corresponding transcriptions in Egyptian Arabic.
126
+
127
+ ## Training Procedure
128
+
129
+ * **Framework:** SpeechBrain (`speechbrain==1.0.3`) with Hugging Face Transformers (`transformers==4.51.3`) and Accelerate (`accelerate==0.25.0`).
130
+ * **Base Model:** `openai/whisper-small`
131
+ * **Dataset:** `MAdel121/arabic-egy-cleaned`
132
+ * **Epochs:** 10
133
+ * **Optimizer:** AdamW (`lr=1e-5`, `weight_decay=0.05`)
134
+ * **LR Scheduler:** NewBob (`improvement_threshold=0.0025`, `annealing_factor=0.9`, `patient=0`)
135
+ * **Warmup Steps:** 1000
136
+ * **Batch Size:** 8 (fixed, no dynamic batching)
137
+ * **Gradient Accumulation:** 2 steps (effective batch size: 16)
138
+ * **Gradient Clipping:** Max norm 5.0
139
+ * **Mixed Precision:** Not explicitly mentioned, assumed FP32 or handled by Accelerate/Trainer.
140
+ * **Augmentation:** Enabled (`augment_prob_master=0.5`, `min_augmentations=1`, `max_augmentations=3`) with the following techniques applied randomly from the pool:
141
+ * DropChunk (`length: 1600-4800 samples`, `count: 1-5`)
142
+ * DropFreq (`count: 1-3`)
143
+ * DropBitResolution
144
+ * **Training Environment:** Modal Labs (`gpu=A100-40GB`)
145
+
146
+ ## Evaluation Results
147
+
148
+ The model was evaluated on the **test split** of the `MAdel121/arabic-egy-cleaned` dataset.
149
+
150
+ | Metric | Value (%) |
151
+ | :----- | :-------- |
152
+ | WER | 22.69 |
153
+ | CER | 16.70 |
154
+
155
+ *WER (Word Error Rate) and CER (Character Error Rate) are reported. Lower is better.*
156
+
157
+ Validation metrics at the end of training (Epoch 10):
158
+ * Validation WER: 22.79%
159
+ * Validation CER: 16.76%
160
+
161
+ ## Citation
162
+
163
+ If you use this model, please consider citing the original Whisper paper and the dataset used:
164
+
165
+ ```bibtex
166
+ @article{radford2023robust,
167
+ title={Robust Speech Recognition via Large-Scale Weak Supervision},
168
+ author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
169
+ journal={arXiv preprint arXiv:2212.04356},
170
+ year={2023}
171
+ }
172
+
173
+ @misc{adel_mohamed_2024_12860997,
174
+ author = {Adel Mohamed},
175
+ title = {MAdel121/arabic-egy-cleaned},
176
+ month = jun,
177
+ year = 2024,
178
+ publisher = {Zenodo},
179
+ doi = {10.5281/zenodo.12860997},
180
+ url = {https://doi.org/10.5281/zenodo.12860997}
181
+ }
182
+
183
+ @misc{speechbrain,
184
+ title={{SpeechBrain}: A General-Purpose Speech Toolkit},
185
+ author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
186
+ year={2021},
187
+ eprint={2106.04624},
188
+ archivePrefix={arXiv},
189
+ primaryClass={eess.AS},
190
+ note={arXiv:2106.04624}
191
+ }
192
+ ```
193
+
194
+ ## Model Card Authors
195
+
196
+ [Your Name/Organization Here]
197
+
198
  *(Based on training run `ceeu3g6c`)*