This model is a fine-tuned version of openai/whisper-small on the wTIMIT-US dataset using the F0-Mask augmentation method. It was evaluated on both normal and whispered speech subsets, with Word Error Rate (WER) as the primary metric.

The results below highlight performance improvements over the baseline for whispered speech, validating the effectiveness of phoneme-aware low-frequency masking (PALF-Mask).

Evaluation Results on wTIMIT-US (Test Set)

Setup Training Data Augmentation WER (Normal) WER (Whispered)
No Fine-tuning Zero-shot None 5.0 13.7
Baseline Both modes None 5.8 11.7
SpecAugment Both modes SpecAugment (LD) 5.2 12.3
F0-Mask (Ours) Both modes F0-based Masking 5.0 (ns, p=0.144) 11.5 (β˜…, p=0.002)

β˜… = Statistically significant improvement over SpecAugment (paired MAPSSWE)
ns = No significant difference (not statistically significant)

Compared to the SpecAugment baseline, F0-Mask achieved a statistically significant improvement in whispered speech recognition (↓0.8% absolute WER, p=0.002), while maintaining comparable performance on normal speech (p=0.144).

Notably, the whispered WER of 11.5% matches the best result previously reported on this dataset by Marchenko (2024).

Cite as

Kokowski, J. (2025). F0-Based Masking Policies for Self-Supervised Whispered Speech Recognition. Master’s Thesis, University of Groningen, Campus FryslΓ’n.
Available at: https://campus-fryslan.studenttheses.ub.rug.nl/view/degree_programme/voice=5Ftechnology.html

If you use this model or build upon this work, please cite the thesis above.

Model: Whisper-small
Augmentation: F0-Mask
Evaluation toolkit: SCTK (sclite)
Notes: For complete results, including MAPSSWE and CER scores, refer to Section 5 of the thesis.

πŸ”— Related Models

Downloads last month
51
Safetensors
Model size
242M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jankoko/PALF-Whisper-small

Finetuned
(2758)
this model