This model is a fine-tuned version of openai/whisper-small
on the wTIMIT-US dataset using SpecAugment, a time- and frequency-masking data augmentation method.
The model was fine-tuned jointly on normal and whispered speech, using SpecAugment in its LibriSpeech Double (LD) configuration. It serves as a baseline for comparison against phone-aware masking methods such as F0-Mask, F1-Mask, and LF-Mask.
Evaluation Results on wTIMIT-US (Test Set)
Setup | Training Data | Augmentation | WER (Normal) | WER (Whispered) |
---|---|---|---|---|
Baseline | Both modes | None | 5.8 | 11.7 |
SpecAugment | Both modes | SpecAugment (LD) | 5.2 | 12.3 |
SpecAugment significantly improved WER on normal speech compared to the baseline without augmentation (p=0.014), while showing no statistically significant difference in whispered speech performance (p=0.147).
Cite as
Kokowski, J. (2025). F0-Based Masking Policies for Self-Supervised Whispered Speech Recognition. Masterβs Thesis, University of Groningen, Campus FryslΓ’n.
Available at: https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/674
If you use this model or build upon this work, please cite the thesis above.
Model: Whisper-small
Augmentation: SpecAugment (LD)
Evaluation toolkit: SCTK (sclite)
Notes: For statistical comparisons and MAPSSWE evaluation, see Section 5 of the thesis.
π Related Models
- SpecAugment β current
- F0-Mask Version
- F1-Mask Version
- LF-Mask Version
- Downloads last month
- 21
Model tree for jankoko/SpecAugment-Whisper-small
Base model
openai/whisper-small