Luganda Whisper ASR with a Language Model

This is a fine-tuned Whisper-small model for Luganda ASR using Common Voice and FLEURS datasets, enhanced with a 5-gram KenLM language model for improved transcription quality.

⚠️ Note: OpenAI's Whisper does not officially support "lg" (Luganda) as a recognized language code. To bypass this tokenizer restriction, we use "sw" (Swahili) as a placeholder. This workaround does not affect the model's ability to transcribe Luganda since the model and language model are both fine-tuned specifically for Luganda — it's only needed to satisfy Whisper’s internal tokenizer constraints.

Usage (with whisper-lm)

git clone https://huggingface.co/sulaimank/whisper-small-lg-lm
cd whisper-small-lg-lm
pip install -r requirements.txt

import whisper
from whisper_decoder_with_lm import LMOptions

model_path = "whisper-small-lg.pt"
lm_path = "5gram.bin"

# Set LM parameters
# Optimized alpha and beta for Luganda selected based on which values minimize WER on a subset of your dataset (here: 2,000 samples from Common Voice were used).
LMOptions().lm_path = lm_path
LMOptions().lm_alpha = 0.0211  
LMOptions().lm_beta = 0.0119

# Whisper decode options
decode_options = {
    "language": "sw",  # use Swahili tokenizer as a workaround for Luganda
    "without_timestamps": True,
    "temperature": 0.0,
    "beam_size": 5,
}

# Transcribe audio
model = whisper.load_model(model_path)
result = model.transcribe("your_audio.wav", **decode_options)
print("Transcription:", result["text"])

sulaimank
/

whisper-small-lg-lm

Luganda Whisper ASR with a Language Model

Usage (with whisper-lm)

Model tree for sulaimank/whisper-small-lg-lm

Datasets used to train sulaimank/whisper-small-lg-lm

Space using sulaimank/whisper-small-lg-lm 1

Evaluation results