This model is a fine-tuned version of openai/whisper-small on the wTIMIT-US dataset using SpecAugment, a time- and frequency-masking data augmentation method.

The model was fine-tuned jointly on normal and whispered speech, using SpecAugment in its LibriSpeech Double (LD) configuration. It serves as a baseline for comparison against phone-aware masking methods such as F0-Mask, F1-Mask, and LF-Mask.

Evaluation Results on wTIMIT-US (Test Set)

Setup Training Data Augmentation WER (Normal) WER (Whispered)
Baseline Both modes None 5.8 11.7
SpecAugment Both modes SpecAugment (LD) 5.2 12.3

SpecAugment significantly improved WER on normal speech compared to the baseline without augmentation (p=0.014), while showing no statistically significant difference in whispered speech performance (p=0.147).

Cite as

Kokowski, J. (2025). F0-Based Masking Policies for Self-Supervised Whispered Speech Recognition. Master’s Thesis, University of Groningen, Campus FryslΓ’n.
Available at: https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/674

If you use this model or build upon this work, please cite the thesis above.

Model: Whisper-small
Augmentation: SpecAugment (LD)
Evaluation toolkit: SCTK (sclite)
Notes: For statistical comparisons and MAPSSWE evaluation, see Section 5 of the thesis.

πŸ”— Related Models

Downloads last month
21
Safetensors
Model size
242M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jankoko/SpecAugment-Whisper-small

Finetuned
(2757)
this model