Swahili-English Speech-to-Text (STT) Model
This model is a fine-tuned version of openai/whisper-medium specifically optimized for Swahili and English speech recognition. The model has been trained on Common Voice 17.0 dataset and achieves significant improvements in word error rate (WER) compared to the base model.
Model Performance
The model achieves the following results on the evaluation set:
- Loss: 0.3390
- WER: 14.7
Usage
Installation
First, install the required dependencies:
pip install transformers torch librosa
Basic Usage
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
import librosa
# Load the model and processor
processor = AutoProcessor.from_pretrained("Jacaranda-Health/ASR-STT")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Jacaranda-Health/ASR-STT")
model.generation_config.forced_decoder_ids = None
def transcribe(filepath):
"""
Transcribe audio file to text
Args:
filepath (str): Path to audio file
Returns:
str: Transcribed text
"""
# Load audio file
audio, sr = librosa.load(filepath, sr=16000)
# Process audio
inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
# Generate transcription
with torch.no_grad():
generated_ids = model.generate(inputs["input_features"])
# Decode the transcription
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
return transcription
# Example usage
transcription = transcribe("path/to/your/audio.wav")
print(f"Transcription: {transcription}")
Batch Processing
def transcribe_batch(audio_files):
"""
Transcribe multiple audio files
Args:
audio_files (list): List of audio file paths
Returns:
list: List of transcriptions
"""
transcriptions = []
for filepath in audio_files:
try:
transcription = transcribe(filepath)
transcriptions.append({
'file': filepath,
'transcription': transcription
})
except Exception as e:
transcriptions.append({
'file': filepath,
'error': str(e)
})
return transcriptions
# Example usage
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = transcribe_batch(audio_files)
Model Comparison
The fine-tuned model shows dramatic improvements over the base Whisper model, particularly in Swahili language accuracy. Here are some comparison examples showing how the base model completely failed while our fine-tuned model nailed it:
Example 1: Complete Language Confusion
Ground Truth: "Panya wengi huishi kati ya wanadamu."
Base Model: "本来我以为是个铁网来的" (Chinese characters!)
Fine-tuned Model: "Wanyawengi huishi kati ya wanadamu." ✓
Ground Truth: "Mji ulianzishwa kwenye kisiwa kilichopo karibu sana na bara."
Base Model: "Nguni unia nzisho kwenye kisiwa kilichopo kariwu sana nabara"
Fine-tuned Model: "Mji ulianzishwa kwenye kisiwa kilichopo karibu sana na bara." ✓
Ground Truth: "Nchi ya maajabu."
Base Model: "Um dia mais, diabo!" (Portuguese/Spanish)
Fine-tuned Model: "Nchi ya maajabu." ✓
Example 2: Arabic Script Mix
- Ground Truth: "Alama yake ni µm."
- Base Model: "الله معاكي لأم" (Arabic script)
- Fine-tuned Model: "Alama yake ni µm." ✓
Example 3: English Instead of Swahili
- Ground Truth: "Ni msimamizi wa mtandao na wa wanafunzi."
- Base Model: "You don't see no music on Tyndale? No, I don't see no music on Tyndale."
- Fine-tuned Model: "Ni msimamizi wa mtandao na wa wanafunzi." ✓
Key Improvements
The fine-tuned model demonstrates superior performance in:
- Swahili Grammar: Better handling of Swahili sentence structure and grammar
- Word Recognition: More accurate recognition of Swahili vocabulary
- Context Understanding: Improved contextual understanding across different domains
- Pronunciation Variants: Better handling of different Swahili pronunciation patterns
- Mixed Language: Enhanced performance on code-switched Swahili-English content
Training Visualizations
The following charts illustrate the model's training progress and performance improvements:
Word Error Rate (WER) Progress
The WER chart shows the steady improvement in transcription accuracy throughout the training process. Starting from approximately 21.6% WER at step 500, the model achieves its best performance of 14.7% WER by step 8000, demonstrating consistent learning and convergence.
Learning Rate Schedule
The learning rate follows a cosine annealing schedule, starting at 1e-05 and gradually decreasing over the 8000 training steps. This schedule helps ensure stable training and prevents overfitting while allowing the model to fine-tune effectively.
Training Details
Training Procedure
The model was fine-tuned using the following approach:
- Base Model: OpenAI Whisper Medium
- Dataset: Mozilla Common Voice 17.0 (Swahili and English)
- Training Steps: 8,000 steps
- Learning Rate: 1e-05 with cosine scheduler
- Batch Size: 16 (train and eval)
Training Hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 50
- training_steps: 8000
- mixed_precision_training: Native AMP
Training Results
Training Loss | Epoch | Step | Validation Loss | WER Ortho | WER |
---|---|---|---|---|---|
0.4135 | 0.6180 | 500 | 0.4069 | 29.9115 | 21.6319 |
0.2036 | 1.2361 | 1000 | 0.3584 | 25.8738 | 18.3552 |
0.1899 | 1.8541 | 1500 | 0.3390 | 24.0940 | 16.4814 |
0.0978 | 2.4722 | 2000 | 0.3406 | 24.1957 | 16.8982 |
0.0584 | 3.0902 | 2500 | 0.3589 | 22.7718 | 15.9189 |
0.0457 | 3.7083 | 3000 | 0.3660 | 23.3075 | 15.8580 |
0.0203 | 4.3263 | 3500 | 0.3762 | 22.9108 | 15.7394 |
0.0193 | 4.9444 | 4000 | 0.3683 | 22.0192 | 15.2616 |
0.0073 | 5.5624 | 4500 | 0.3926 | 22.5447 | 15.5801 |
0.0022 | 6.1805 | 5000 | 0.4065 | 21.5649 | 14.9092 |
0.0022 | 6.7985 | 5500 | 0.4080 | 21.2835 | 14.6313 |
0.0009 | 7.4166 | 6000 | 0.4180 | 21.2564 | 14.6415 |
0.0007 | 8.0346 | 6500 | 0.4244 | 21.2361 | 14.6551 |
0.0006 | 8.6527 | 7000 | 0.4283 | 21.3276 | 14.6957 |
0.0006 | 9.2707 | 7500 | 0.4297 | 21.3378 | 14.7059 |
0.0006 | 9.8888 | 8000 | 0.4300 | 21.3276 | 14.7093 |
Supported Languages
- Primary: Swahili (sw)
- Secondary: English (en)
Out-of-Scope Use
The use of this Speech-to-Text (ASR) model is intended for research, social good, and internal use purposes only. For commercial use and distribution, organizations/individuals are encouraged to contact Jacaranda Health. To ensure the ethical and responsible use of this ASR model, we have outlined a set of guidelines. These guidelines categorize activities and practices into three main areas: prohibited actions, high-risk activities, and deceptive practices. By understanding and adhering to these directives, users can contribute to a safer and more trustworthy environment.
1. Prohibited Actions:
- Illegal Activities: Avoid using the model to transcribe content that promotes violence, child exploitation, human trafficking, and other crimes.
- Harassment and Discrimination: No transcription activities that facilitate bullying, threats, or discriminatory practices.
- Unauthorized Surveillance: No unlicensed monitoring or recording of individuals without proper consent.
- Data Misuse: Handle audio data and transcriptions with proper consents and privacy protections.
- Rights Violations: Respect third-party intellectual property and privacy rights in audio content.
- Malicious Content Creation: Avoid transcribing content intended for harmful software or malicious purposes.
2. High-Risk Activities:
- Sensitive Industries: Exercise extreme caution when using in military, nuclear, or intelligence domains.
- Legal Proceedings: Avoid usage as sole evidence in critical legal or judicial processes without proper validation.
- Critical Systems: No deployment in safety-critical infrastructure or transport technologies without extensive testing.
- Medical Diagnosis: Avoid using transcriptions for direct medical diagnosis or treatment decisions.
- Emergency Services: Not suitable as primary tool for emergency response systems.
3. Deceptive Practices:
- Misinformation: Refrain from using transcriptions to create or promote fraudulent or misleading audio content.
- Deepfake Audio: Avoid using transcriptions to facilitate creation of deceptive synthetic audio.
- Impersonation: No transcribing content intended to impersonate individuals without authorization.
- Misrepresentation: No false claims about transcription accuracy or model capabilities.
- Fake Content Generation: No promotion of false audio-text pairs or fabricated conversations.
Bias, Risks, and Limitations
This Speech-to-Text model represents cutting-edge technology with significant potential, yet it is not without inherent risks and limitations. The extensive testing conducted has been predominantly focused on Swahili and English languages, leaving an expansive terrain of linguistic variations and acoustic scenarios unexplored.
Key Limitations:
Language and Dialect Variations: The model's performance may vary significantly across different Swahili dialects, regional accents, and code-switching patterns not represented in the training data.
Audio Quality Sensitivity: Performance degrades with poor audio quality, background noise, multiple speakers, or non-standard recording conditions.
Domain Specificity: The model may struggle with highly technical terminology, proper names, or domain-specific vocabulary outside its training scope.
Contextual Understanding: While improved over the base model, contextual interpretation limitations may lead to incorrect transcriptions in ambiguous scenarios.
Bias Considerations: Like other AI models, this ASR system may exhibit biases present in the training data, potentially affecting transcription quality for underrepresented speaker groups or topics.
Responsible Deployment:
Consequently, like other ASR systems, this model's output predictability remains variable, and there's potential for it to occasionally generate transcriptions that are inaccurate, culturally insensitive, or otherwise problematic when processing certain audio inputs.
Prior to deploying this ASR model in any production applications, developers must embark on thorough safety testing and meticulous evaluation customized to the unique demands of their specific use cases. This includes testing across diverse speaker demographics, audio conditions, and content types relevant to the intended application.
Contact Us
For any questions, feedback, or commercial inquiries, please reach out at [email protected]
Framework Versions
- Transformers 4.51.3
- PyTorch 2.5.1+cu121
- Datasets 3.6.0
- Tokenizers 0.21.1
Citation
If you use this model in your research, please cite:
@misc{jacaranda_asr_stt_2025,
title={Swahili-English Speech-to-Text Model},
author={Jacaranda Health},
year={2025},
howpublished={\url{https://huggingface.co/Jacaranda-Health/ASR-STT}}
}
- Downloads last month
- 95