Medibeng Whisper Tiny

Model Description

Medibeng Whisper Tiny is a fine-tuned version of the Whisper model for automatic speech recognition (ASR), specifically designed to transcribe and translate code-switched Bengali-English conversations into English. This model is designed for clinical settings and can handle audio that contains a mix of Bengali and English, making it suitable for transcription and translation tasks in multilingual environments, such as medical and healthcare settings.

Repository: https://github.com/pr0mila/MediBeng-Whisper-Tiny

Usage

To use the Medibeng Whisper Tiny model for translating code-switched Bengali-English conversations into English, follow this example:

Please install the package first:

pip install pandas transformers librosa

Run this code:

import os
import pandas as pd
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

# Set the model path and language/task
model_path = "pr0mila-gh0sh/MediBeng-Whisper-Tiny"
LANGUAGE = "en"  # Target language for translation
TASK = "translate"  # Translation task

# Load model and processor from the specified path
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path)

# Get forced decoder IDs for translation task to English
forced_decoder_ids = processor.get_decoder_prompt_ids(language=LANGUAGE, task=TASK)

# Path to your single audio file
audio_file_path = "path_to_audio.wav"

# Load and preprocess the audio file using librosa
audio_input, _ = librosa.load(audio_file_path, sr=16000)

# Process the audio sample into input features for the Whisper model
input_features = processor(audio_input, sampling_rate=16000, return_tensors="pt").input_features

# Generate token ids for the transcription/translation
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)

# Decode token ids to text (translation)
translation = processor.batch_decode(predicted_ids, skip_special_tokens=True)

# Output the transcription/translation result
print("Translation:", translation[0])

Key Features:

Speech-to-text: Converts code-mixed Bengali-English audio to English text.
Clinical Setting: Fine-tuned on a medical dataset containing clinical conversations, enabling it to handle healthcare-specific terminology.
Code-mixed Speech: Designed to handle code-switching between Bengali and English, which is common in multilingual regions.

Intended Use

This model is intended for use by researchers and developers working with code-mixed Bengali-English audio in the clinical domain. It is suitable for:

Medical transcription services where conversations involve both Bengali and English.
Voice assistants in healthcare, assisting healthcare providers in multilingual settings.
Speech-to-text applications in healthcare environments, particularly for doctors and patients speaking a mix of Bengali and English.

The model works best in environments where both Bengali and English are used interchangeably, particularly in healthcare or clinical scenarios.

Training Data

The model was fine-tuned on the MediBeng dataset, which consists of code-switched Bengali-English conversations in clinical settings.

Dataset Size: 20% of the MediBeng dataset was used for fine-tuning. The dataset is available on Hugging Face.
Data Source: MediBeng dataset
Data Process Source: https://github.com/pr0mila/ParquetToHuggingFace
Data Characteristics: The dataset contains conversational speech with both Bengali and English, with specific focus on medical terminologies and clinical dialogues.

Evaluation Results

The model's performance improved as the training progressed, showing consistent reduction in training loss and Word Error Rate (WER) on the evaluation set.

Epoch	Training Loss	Training Grad Norm	Learning Rate	Eval Loss	Eval WER
0.03	2.6213	61.56	4.80E-06	-	-
0.07	1.609	44.09	9.80E-06	1.13	107.72
0.1	0.7685	52.27	9.47E-06	-	-
0.13	0.4145	32.27	8.91E-06	0.37	47.53
0.16	0.3177	17.98	8.36E-06	-	-
0.2	0.222	7.7	7.80E-06	0.1	45.19
0.23	0.0915	1.62	7.24E-06	-	-
0.26	0.081	0.4	6.69E-06	0.04	38.35
0.33	0.0246	1.01	5.58E-06	-	-
0.36	0.0212	2.2	5.02E-06	0.01	41.88
0.42	0.0052	0.13	3.91E-06	-	-
0.46	0.0023	0.45	3.36E-06	0.01	34.07
0.52	0.0013	0.05	1.69E-06	-	-
0.55	0.0032	0.11	1.13E-06	0.01	29.52
0.62	0.001	0.09	5.78E-07	-	-
0.65	0.0012	0.08	2.22E-08	0	30.49

Training Loss: The training loss decreases consistently, indicating the model is learning well.
Eval Loss: The evaluation loss decreases significantly, showing that the model is generalizing well to unseen data.
Eval WER: The Word Error Rate (WER) decreases over the epochs, indicating the model is getting better at transcribing code-switched Bengali-English speech.

Limitations

Accents: The model may struggle with very strong regional accents or non-native speakers of Bengali and English.
Specialized Terms: The model may not perform well with highly specialized medical terms or out-of-domain speech.
Multilingual Support: While the model is designed for Bengali and English, other languages are not supported.

Known Issue in Current Release

Evaluation currently uses Word Error Rate (WER) during training.
WER is not ideal for translation tasks.
Future updates will include BLEU, METEOR, or chrF++ metrics for more accurate evaluation.

Ethical Considerations

Biases: The training data may contain biases based on the demographics of the speakers, such as gender, age, and accent.
Misuse: Like any ASR system, this model could be misused to create fake transcripts of audio recordings, potentially leading to privacy and security concerns.
Fairness: Ensure the model is used in contexts where fairness and ethical considerations are taken into account, particularly in clinical environments.

Blog Post

I’ve written a detailed blog post on Medium about MediBeng Whisper-Tiny and how it translates code-switched Bengali-English speech in healthcare. In this post, I discuss the dataset creation, model fine-tuning, and how this can improve healthcare transcription. Read the full article here: MediBeng Whisper-Tiny: Translating Code-Switched Bengali-English Speech for Healthcare

Citation for Research Use

If you use Medibeng Whisper-Tiny or the MediBeng dataset for your research or project, please cite the following:

For Medibeng Whisper-Tiny Model (Fine-Tuned Model):

The preprint is available on medRxiv.

@article{ghosh2025medibeng,
  title={MediBeng Whisper Tiny: A fine-tuned code-switched Bengali-English translator for clinical applications},
  author={Ghosh, Promila and Talukder, Sunipun},
  journal={medRxiv},
  year={2025},
  doi={https://doi.org/10.1101/2025.04.25.25326406},
  url={https://www.medrxiv.org/content/10.1101/2025.04.25.25326406v2}
}

pr0mila-gh0sh
/

MediBeng-Whisper-Tiny