This model is a fine-tuned version of openai/whisper-small on the wTIMIT-US dataset using the LF-Mask (Low-Frequency Masking) augmentation method. LF-Mask applies frequency masking below 1.5 kHz for voiced phonemes, motivated by findings from acoustic studies of whispered speech.

It was evaluated on both normal and whispered speech subsets using Word Error Rate (WER) as the primary metric. While LF-Mask maintained comparable performance to SpecAugment, it did not produce statistically significant improvements in either speaking mode.

For a full explanation of the masking range and its acoustic motivation, refer to Section 3 of the thesis linked below.

Evaluation Results on wTIMIT-US (Test Set)

Setup Training Data Augmentation WER (Normal) WER (Whispered)
No Fine-tuning Zero-shot None 5.0 13.7
Baseline Both modes None 5.8 11.7
SpecAugment Both modes SpecAugment (LD) 5.2 12.3
LF-Mask (Ours) Both modes Masking < vowel range 5.3 (p=0.624) 11.9 (p=0.384)

Compared to SpecAugment, LF-Mask showed no statistically significant change in WER for either normal or whispered speech (p=0.624 and p=0.384, respectively).

Cite as

Kokowski, J. (2025). F0-Based Masking Policies for Self-Supervised Whispered Speech Recognition. Master’s Thesis, University of Groningen, Campus FryslΓ’n.
Available at: https://campus-fryslan.studenttheses.ub.rug.nl/id/eprint/674

If you use this model or build upon this work, please cite the thesis above.

Model: Whisper-small
Augmentation: LF-Mask
Evaluation toolkit: SCTK (sclite)
Notes: For complete results, including MAPSSWE and CER scores, refer to Section 5 of the thesis.

πŸ”— Related Models

Downloads last month
26
Safetensors
Model size
242M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jankoko/PALF-LF-Whisper-small

Finetuned
(2758)
this model