jankoko/SpecAugment-Whisper-small

This model is a fine-tuned version of openai/whisper-small on the wTIMIT-US dataset using SpecAugment, a time- and frequency-masking data augmentation method.

The model was fine-tuned jointly on normal and whispered speech, using SpecAugment in its LibriSpeech Double (LD) configuration. It serves as a baseline for comparison against phone-aware masking methods such as F0-Mask, F1-Mask, and LF-Mask.

Evaluation Results on wTIMIT-US (Test Set)

Setup	Training Data	Augmentation	WER (Normal)	WER (Whispered)
Baseline	Both modes	None	5.8	11.7
SpecAugment	Both modes	SpecAugment (LD)	5.2	12.3

SpecAugment significantly improved WER on normal speech compared to the baseline without augmentation (p=0.014), while showing no statistically significant difference in whispered speech performance (p=0.147).

Cite as

Kokowski, J. (2025). F0-Based Masking Policies for Self-Supervised Whispered Speech Recognition. Master’s Thesis, University of Groningen, Campus Fryslân.
Available at: https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/674

If you use this model or build upon this work, please cite the thesis above.

Model: Whisper-small
Augmentation: SpecAugment (LD)
Evaluation toolkit: SCTK (sclite)
Notes: For statistical comparisons and MAPSSWE evaluation, see Section 5 of the thesis.

jankoko
/

SpecAugment-Whisper-small

Evaluation Results on wTIMIT-US (Test Set)

Cite as

🔗 Related Models

Model tree for jankoko/SpecAugment-Whisper-small