Whisper-medium Singlish2English transcription model

Hugging Face

Model overview

This model is a fine-tuned version of openai/whisper-medium, trained on over 2 million speech samples from the Singapore National Speech Corpus (NSC). It focuses on Singaporean-accented English (Singlish), which are typically underrepresented in general-purpose ASR systems.


Custom dataset overview

To enable fine-tuning of open-source foundation ASR models, we curated NSCP16 bespoke dataset constructed from the NSC corpus. It is designed to capture the range and richness of Singlish across both non-conversational and conversational contexts.

  • Non-conversational speech includes:

    • Part 1: Phonetically-balanced scripts consisting of standard English sentences spoken in local accents.
    • Part 2: Sentences randomly generated from themes such as people, food, places, and brands.
  • Conversational and expressive speech includes:

    • Part 3: Natural dialogues on everyday topics between Singaporean speakers.
    • Part 5: Stylized recordings simulating debates, finance-related discussions, and emotional expressions (both positive and negative).
    • Part 6: Scenario-based dialogues, where speakers engage in topic-driven, semi-scripted interactions covering various themes.

Together, these components make NSCP16 a robust dataset for building speech models that generalize well across local speech styles, tones, and speaking conditions.

Table 1: Overview of the custom-created transcription datasets.

Name Samples Total hours Avg. duration (s) Min (s) Max (s)
NSCP16_train 2,048,000 2944.1 5.2 0.1 30.1
NSCP16_valid 50,000 73.4 5.3 0.8 29.1
NSCP16_test 10,000 19.1 6.9 1.0 26.1

Evaluation

Evaluation was conducted on the held-out NSCP16 dataset. Performance was measured using Word Error Rate (WER), comparing the fine-tuned model against the off-the-shelf Whisper-medium baseline.

Table 2: Evaluation results on the test dataset using WER. A lower WER indicates better performance (โ†“).

Model WER (โ†“)
Whisper-medium (off-the-shelf) 21.09
Whisper-medium-Sing2Eng (fine-tuned) 6.63

This represents a 14.46 percentage point absolute reduction and a 68.5% relative improvement in WER over the baseline Whisper-medium model on the NSCP16 test set.

By learning from diverse local accents and speaking styles, this model significantly improves transcription accuracy for Singaporean speech, making it suitable for both research and production applications in multilingual and code-switched environments.

Usage

import torchaudio, torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

model_name = 'ivabojic/whisper-medium-sing2eng-transcribe'
audio_path = 'path_to_audio'  # e.g: https://github.com/IvaBojic/Singlish2English/blob/main/small_dataset/audios/00862042_713.wav

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name)

# Load and resample audio if needed
audio, sr = torchaudio.load(audio_path)
if sr != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)
    audio = resampler(audio)
audio = audio.squeeze().numpy()

# Preprocess and generate transcription
inputs = processor(audio=audio, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(inputs.input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Project repository

For training scripts, evaluation tools, sample audio files, and more, visit the GitHub repository: https://github.com/IvaBojic/Singlish2English

Downloads last month
46
Safetensors
Model size
764M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ivabojic/whisper-medium-sing2eng-transcribe

Finetuned
(672)
this model