Whisper-medium Singlish2English transcription model

Model overview

This model is a fine-tuned version of openai/whisper-medium, trained on over 2 million speech samples from the Singapore National Speech Corpus (NSC). It focuses on Singaporean-accented English (Singlish), which are typically underrepresented in general-purpose ASR systems.

Custom dataset overview

To enable fine-tuning of open-source foundation ASR models, we curated NSC_P16 bespoke dataset constructed from the NSC corpus. It is designed to capture the range and richness of Singlish across both non-conversational and conversational contexts.

Non-conversational speech includes:
- Part 1: Phonetically-balanced scripts consisting of standard English sentences spoken in local accents.
- Part 2: Sentences randomly generated from themes such as people, food, places, and brands.
Conversational and expressive speech includes:
- Part 3: Natural dialogues on everyday topics between Singaporean speakers.
- Part 5: Stylized recordings simulating debates, finance-related discussions, and emotional expressions (both positive and negative).
- Part 6: Scenario-based dialogues, where speakers engage in topic-driven, semi-scripted interactions covering various themes.

Together, these components make NSC_P16 a robust dataset for building speech models that generalize well across local speech styles, tones, and speaking conditions.

Table 1: Overview of the custom-created transcription datasets.

Name	Samples	Total hours	Avg. duration (s)	Min (s)	Max (s)
NSC_{P16_train}	2,048,000	2944.1	5.2	0.1	30.1
NSC_{P16_valid}	50,000	73.4	5.3	0.8	29.1
NSC_{P16_test}	10,000	19.1	6.9	1.0	26.1

Evaluation

Evaluation was conducted on the held-out NSC_P16 dataset. Performance was measured using Word Error Rate (WER), comparing the fine-tuned model against the off-the-shelf Whisper-medium baseline.

Table 2: Evaluation results on the test dataset using WER. A lower WER indicates better performance (↓).

Model	WER (↓)
Whisper-medium (off-the-shelf)	21.09
Whisper-medium-Sing2Eng (fine-tuned)	6.63

This represents a 14.46 percentage point absolute reduction and a 68.5% relative improvement in WER over the baseline Whisper-medium model on the NSC_P16 test set.

By learning from diverse local accents and speaking styles, this model significantly improves transcription accuracy for Singaporean speech, making it suitable for both research and production applications in multilingual and code-switched environments.

Usage

import torchaudio, torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

model_name = 'ivabojic/whisper-medium-sing2eng-transcribe'
audio_path = 'path_to_audio'  # e.g: https://github.com/IvaBojic/Singlish2English/blob/main/small_dataset/audios/00862042_713.wav

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name)

# Load and resample audio if needed
audio, sr = torchaudio.load(audio_path)
if sr != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)
    audio = resampler(audio)
audio = audio.squeeze().numpy()

# Preprocess and generate transcription
inputs = processor(audio=audio, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(inputs.input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Project repository

For training scripts, evaluation tools, sample audio files, and more, visit the GitHub repository: https://github.com/IvaBojic/Singlish2English