Whisper Large V3 Turbo - Swiss German Fine-tuned

This model is a fine-tuned version of OpenAI's Whisper Large V3 Turbo specifically adapted for Swiss German (Schweizerdeutsch) automatic speech recognition. The model transcribes Swiss German speech to Standard German text. EVALUATION HAS TO BE DONE STILL

Model Description

Base Model: openai/whisper-large-v3-turbo
Language(s): Swiss German dialects → Standard German text
Model Size: 809M parameters
License: Apache 2.0
Finetuned from: openai/whisper-large-v3-turbo

Performance

The model achieves state-of-the-art performance on Swiss German ASR tasks:

Word Error Rate (WER): %
Character Error Rate (CER): %
Training Data: 350+ hours of Swiss German speech

Training Data

This model was fine-tuned on a comprehensive dataset of Swiss German speech, including:

SwissDial-Zh v1.1: 24 hours of balanced Swiss German dialects
Swiss Parliament Corpus V2 (SPC): 293 hours of parliamentary speech data
All Swiss German Dialects Test Set: 13 hours with representative dialect distribution
ArchiMob Release 2: 70 hours

Total training data: 350+ hours of high-quality Swiss German speech with Standard German transcriptions.

Supported Dialects

The model supports all major Swiss German dialects:

Aargau (AG)
Bern (BE)
Basel (BS)
Graubünden (GR)
Lucerne (LU)
St. Gallen (SG)
Valais (VS)
Zurich (ZH)

Usage

Quick Start with Pipeline

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "Flurin17/whisper-large-v3-turbo-swiss-german"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True, 
    use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

# Transcribe a Swiss German audio file
result = pipe("path/to/swiss_german_audio.wav")
print(result["text"])

Batch Processing

# Process multiple files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = pipe(audio_files, batch_size=8)

for result in results:
    print(result["text"])

With Timestamps

# Get word-level timestamps
result = pipe("swiss_german_audio.wav", return_timestamps="word")
print(result["chunks"])

# Get sentence-level timestamps  
result = pipe("swiss_german_audio.wav", return_timestamps=True)
print(result["chunks"])

Advanced Usage with Model + Processor

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import librosa

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "Flurin17/whisper-large-v3-turbo-swiss-german"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# Load and preprocess audio
audio_array, sampling_rate = librosa.load("swiss_german_audio.wav", sr=16000)

inputs = processor(
    audio_array,
    sampling_rate=sampling_rate,
    return_tensors="pt"
)
inputs = inputs.to(device, dtype=torch_dtype)

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(**inputs)

# Decode the transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Training Details

Training Hyperparameters

Learning Rate: 2e-5
Batch Size: 24 per device (train), 4 per device (eval)
Gradient Accumulation Steps: 2
Epochs: 3
Weight Decay: 0.005
Warmup Ratio: 0.03
Precision: bfloat16
Optimizer: AdamW

Training Infrastructure

Hardware: 4x NVIDIA A100 GPUs (80GB each)
Compute: Azure Machine Learning
Training Time: ~5 hours
Framework: 🤗 Transformers, PyTorch

Data Processing

The training data was processed with the following pipeline:

Audio resampling to 16kHz
Log-Mel spectrogram feature extraction (128 Mel bins)
Text normalization and tokenization
Dynamic batching with sequence length grouping

Comparison with Other Models

Model	WER	CER	Parameters
whisper-large-v3-turbo-swiss-german	%	****	809M
whisper-large-v3-turbo (zero-shot)		%	809M

Limitations and Bias

Domain: Primarily trained on read speech and parliamentary proceedings
Dialects: Performance may vary across different Swiss German dialects
Audio Quality: Best performance on clean, high-quality audio recordings
Speaker Demographics: Training data may not be fully representative of all speaker groups
Transcription Style: Outputs Standard German text, not dialectal transcriptions

Model Card Authors

Flurin17 - Model development and fine-tuning

Citation

If you use this model in your research, please cite:

@misc{whisper-large-v3-turbo-swiss-german-2024,
  author = {Flurin17},
  title = {Whisper Large V3 Turbo Fine-tuned for Swiss German},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Flurin17/whisper-large-v3-turbo-swiss-german}
}

Also consider citing the original Whisper paper:

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

And the Swiss German datasets used for training:

@article{dogan2021swissdial,
  title={SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German},
  author={Dogan-Schönberger, Pelin and Mäder, Julian and Hofmann, Thomas},
  journal={arXiv preprint arXiv:2103.11401},
  year={2021}
}

@inproceedings{samardzic2016archimob,
  title={ArchiMob - A Corpus of Spoken Swiss German},
  author={Samardžić, Tanja and Scherrer, Yves and Glaser, Elvira},
  booktitle={Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  pages={4061--4066},
  year={2016},
  url={https://aclanthology.org/L16-1641}
}

@article{scherrer2019digitising,
  title={Digitising Swiss German: how to process and study a polycentric spoken language},
  author={Scherrer, Yves and Samardžić, Tanja and Glaser, Elvira},
  journal={Language Resources and Evaluation},
  volume={53},
  pages={735--769},
  year={2019},
  doi={10.1007/s10579-019-09457-5}
}

@article{pluss2022sds200,
  title={SDS-200: A Swiss German speech to standard German text corpus},
  author={Plüss, Michel and Hürlimann, Manuela and Cuny, Marc and Stöckli, Alla and Kapotis, Nikolaos and Hartmann, Julia and Ulasik, Malgorzata Anna and Scheller, Christian and Schraner, Yanick and Jain, Amit and Deriu, Jan and Cieliebak, Mark and Vogel, Manfred},
  booktitle={Proceedings of the Thirteenth Language Resources and Evaluation Conference},
  pages={3250--3256},
  year={2022},
  address={Marseille, France},
  publisher={European Language Resources Association}
}

@article{pluss2021spc,
  title={Swiss parliaments corpus, an automatically aligned swiss german speech to standard german text corpus},
  author={Plüss, Michel and Neukom, Lukas and Vogel, Manfred},
  journal={arXiv preprint arXiv:2010.02810},
  year={2020}
}

@article{pluss2023stt4sg,
  title={STT4SG-350: A Speech Corpus for Swiss German with Standard German Translations},
  author={Plüss, Michel and Neukom, Lukas and Scheller, Christian and Vogel, Manfred},
  journal={arXiv preprint arXiv:2305.13179},
  year={2023}
}

Acknowledgments

OpenAI for the original Whisper model
Hugging Face for the Transformers library and model hosting
Swiss German speech dataset contributors for providing high-quality training data:
- SwissDial-Zh v1.1: Pelin Dogan-Schönberger, Julian Mäder, Thomas Hofmann (ETH Zurich)
- Swiss Parliament Corpus V2 (SPC): FHNW University of Applied Sciences and Arts Northwestern Switzerland
- SDS-200 Corpus: Research community for comprehensive Swiss German dialect coverage
- ArchiMob Corpus: Tanja Samardžić, Yves Scherrer, Elvira Glaser (University of Zurich)

License

This model is released under the Apache 2.0 license. The original Whisper model is also under Apache 2.0.

Technical Specifications

Architecture: Transformer encoder-decoder
Input: 16kHz mono audio
Output: Standard German text
Context Length: 30 seconds
Sampling Rate: 16,000 Hz
Feature Extraction: 128 Mel-frequency bins
Vocabulary Size: 51,865 tokens

Related Models

openai/whisper-large-v3-turbo - Base model
Flurin17/whisper-large-v3-peft-swiss-german - PEFT version

Flurin17
/

whisper-large-v3-turbo-swiss-german