whisper-swedish-telephonic
Model Overview
whisper-swedish-telephonic
is a fine-tuned version of OpenAI's Whisper-Small model, specifically designed for transcribing Swedish telephonic audio. The model is optimized for low-bandwidth, multi-speaker conversations such as call center interactions.
Key Features:
- Language: Swedish (primary), with limited support for minor English segments.
- Audio Types: Telephonic conversations, customer support recordings, and general low-bandwidth audio.
- Sample Rate: 8kHz (resampled to 16kHz internally).
- Special Tokens: Supports conversational markers, disfluencies, and speaker-specific tags.
- Performance: Demonstrates significantly improved transcription accuracy over the base model for telephonic speech.
Dataset
The model was fine-tuned using the Swedish Telephonic Dataset, consisting of:
- Duration: ~97 hours of annotated audio.
- Domains: Call center recordings, customer service conversations.
- Annotations:
- Speaker IDs and timestamps.
- Conversational tags:
(())
,~
,<overlap>
. - Language switching:
<lang:English>...</lang:English>
.
Preprocessing:
- Audio: Resampled to 16kHz.
- Segmentations: Aligned with timestamps.
- Special Tokens: Includes non-speech sounds like
[cough]
,[laugh]
.
Model Performance
Word Error Rate (WER) Evaluation
The fine-tuned model was benchmarked against OpenAI's base Whisper-Small model using a Swedish telephonic test dataset containing 207 labeled speech segments.
Metric | Fine-Tuned Model | Base Whisper-Small |
---|---|---|
WER | 0.170 | 0.888 |
Key Observations:
- Fine-Tuned Model:
- Excellent transcription accuracy for colloquial Swedish, domain-specific terminology, and long utterances.
- Handles speaker-specific annotations and conversational markers effectively.
- Base Model:
- Struggles with Swedish syntax and domain-specific vocabulary.
- Outputs nonsensical transcriptions for longer or complex sentences.
Example Transcriptions
Segment | Ground Truth | Fine-Tuned Model | Base Model | WER (Fine-Tuned) | WER (Base) |
---|---|---|---|---|---|
1 | så nu | så nu | so, no | 0.000 | 1.000 |
2 | nu record du båda va | nu record du båda va | nu rekordar du båda | 0.000 | 0.400 |
3 | ja jag kommer inte ihåg | ja jag kommer inte ihåg | i am coming to you | 0.000 | 1.000 |
5 | sen när då, sen alltid... inga gäster | sen när då, sen alltid... inga gäster | sen då, sen alltid... ingen gest | 0.000 | 0.250 |
14 | till frankrike | till frankrike | thank you | 0.000 | 1.000 |
Note: Full segment-wise evaluation logs are available in the repository.
Audio Example
This audio file demonstrates the model's transcription abilities:
- File: trimmed_resampled_audio.wav
- Content: Hej du har kommit till Dressmann. Du pratar med Isabelle. Vad kan jag hjälpa dig?
- Audio Type: Telephonic conversation.
- Sample Rate: 16kHz (resampled).
- Purpose: Showcasing the model's capabilities in transcribing Swedish telephonic speech.
Intended Use
This model is designed for:
- Customer Support Automation: Transcription and analysis of call center recordings.
- Telephony Analytics: Sentiment analysis, compliance monitoring, and business intelligence.
- Swedish Language Research: Study of conversational patterns and colloquial expressions.
Limitations:
- Language Support: Primarily Swedish; limited support for English.
- Audio Quality: Optimized for telephonic audio; performance may degrade with studio-quality or highly noisy audio.
- Preprocessing Requirement: Requires resampling non-8kHz audio to 16kHz.
Try the Model
You can test the model using the Hugging Face Playground or the dedicated endpoint:
- Playground: Test the Model
- Dedicated Endpoint: Endpoint URL
How to Use
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import soundfile as sf
# Load model and processor
model_name = "WMRNORDIC/whisper-swedish-telephonic"
model = WhisperForConditionalGeneration.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name)
# Load and preprocess audio
audio, sample_rate = sf.read("path_to_audio.wav")
inputs = processor(audio, sampling_rate=sample_rate, return_tensors="pt")
# Transcribe
generated_ids = model.generate(inputs.input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("Transcription:", transcription)
- Downloads last month
- 203
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for WMRNORDIC/whisper-swedish-telephonic
Base model
openai/whisper-smallDataset used to train WMRNORDIC/whisper-swedish-telephonic
Space using WMRNORDIC/whisper-swedish-telephonic 1
Evaluation results
- Word Error Rate (WER) on Swedish Telephonic Datasettest set self-reported0.170
- Base Model WER (Comparison) on Swedish Telephonic Datasettest set self-reported0.888