Whisper Large V3 Turbo - Swiss German Fine-tuned

This model is a fine-tuned version of OpenAI's Whisper Large V3 Turbo specifically adapted for Swiss German (Schweizerdeutsch) automatic speech recognition. The model transcribes Swiss German speech to Standard German text. EVALUATION HAS TO BE DONE STILL

Model Description

  • Base Model: openai/whisper-large-v3-turbo
  • Language(s): Swiss German dialects → Standard German text
  • Model Size: 809M parameters
  • License: Apache 2.0
  • Finetuned from: openai/whisper-large-v3-turbo

Performance

The model achieves state-of-the-art performance on Swiss German ASR tasks:

  • Word Error Rate (WER): %
  • Character Error Rate (CER): %
  • Training Data: 350+ hours of Swiss German speech

Training Data

This model was fine-tuned on a comprehensive dataset of Swiss German speech, including:

  • SwissDial-Zh v1.1: 24 hours of balanced Swiss German dialects
  • Swiss Parliament Corpus V2 (SPC): 293 hours of parliamentary speech data
  • All Swiss German Dialects Test Set: 13 hours with representative dialect distribution
  • ArchiMob Release 2: 70 hours

Total training data: 350+ hours of high-quality Swiss German speech with Standard German transcriptions.

Supported Dialects

The model supports all major Swiss German dialects:

  • Aargau (AG)
  • Bern (BE)
  • Basel (BS)
  • Graubünden (GR)
  • Lucerne (LU)
  • St. Gallen (SG)
  • Valais (VS)
  • Zurich (ZH)

Usage

Quick Start with Pipeline

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "Flurin17/whisper-large-v3-turbo-swiss-german"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True, 
    use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

# Transcribe a Swiss German audio file
result = pipe("path/to/swiss_german_audio.wav")
print(result["text"])

Batch Processing

# Process multiple files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = pipe(audio_files, batch_size=8)

for result in results:
    print(result["text"])

With Timestamps

# Get word-level timestamps
result = pipe("swiss_german_audio.wav", return_timestamps="word")
print(result["chunks"])

# Get sentence-level timestamps  
result = pipe("swiss_german_audio.wav", return_timestamps=True)
print(result["chunks"])

Advanced Usage with Model + Processor

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import librosa

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "Flurin17/whisper-large-v3-turbo-swiss-german"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# Load and preprocess audio
audio_array, sampling_rate = librosa.load("swiss_german_audio.wav", sr=16000)

inputs = processor(
    audio_array,
    sampling_rate=sampling_rate,
    return_tensors="pt"
)
inputs = inputs.to(device, dtype=torch_dtype)

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(**inputs)

# Decode the transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Training Details

Training Hyperparameters

  • Learning Rate: 2e-5
  • Batch Size: 24 per device (train), 4 per device (eval)
  • Gradient Accumulation Steps: 2
  • Epochs: 3
  • Weight Decay: 0.005
  • Warmup Ratio: 0.03
  • Precision: bfloat16
  • Optimizer: AdamW

Training Infrastructure

  • Hardware: 4x NVIDIA A100 GPUs (80GB each)
  • Compute: Azure Machine Learning
  • Training Time: ~5 hours
  • Framework: 🤗 Transformers, PyTorch

Data Processing

The training data was processed with the following pipeline:

  • Audio resampling to 16kHz
  • Log-Mel spectrogram feature extraction (128 Mel bins)
  • Text normalization and tokenization
  • Dynamic batching with sequence length grouping

Comparison with Other Models

Model WER CER Parameters
whisper-large-v3-turbo-swiss-german % **** 809M
whisper-large-v3-turbo (zero-shot) % 809M

Limitations and Bias

  • Domain: Primarily trained on read speech and parliamentary proceedings
  • Dialects: Performance may vary across different Swiss German dialects
  • Audio Quality: Best performance on clean, high-quality audio recordings
  • Speaker Demographics: Training data may not be fully representative of all speaker groups
  • Transcription Style: Outputs Standard German text, not dialectal transcriptions

Model Card Authors

  • Flurin17 - Model development and fine-tuning

Citation

If you use this model in your research, please cite:

@misc{whisper-large-v3-turbo-swiss-german-2024,
  author = {Flurin17},
  title = {Whisper Large V3 Turbo Fine-tuned for Swiss German},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Flurin17/whisper-large-v3-turbo-swiss-german}
}

Also consider citing the original Whisper paper:

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

And the Swiss German datasets used for training:

@article{dogan2021swissdial,
  title={SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German},
  author={Dogan-Schönberger, Pelin and Mäder, Julian and Hofmann, Thomas},
  journal={arXiv preprint arXiv:2103.11401},
  year={2021}
}

@inproceedings{samardzic2016archimob,
  title={ArchiMob - A Corpus of Spoken Swiss German},
  author={Samardžić, Tanja and Scherrer, Yves and Glaser, Elvira},
  booktitle={Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  pages={4061--4066},
  year={2016},
  url={https://aclanthology.org/L16-1641}
}

@article{scherrer2019digitising,
  title={Digitising Swiss German: how to process and study a polycentric spoken language},
  author={Scherrer, Yves and Samardžić, Tanja and Glaser, Elvira},
  journal={Language Resources and Evaluation},
  volume={53},
  pages={735--769},
  year={2019},
  doi={10.1007/s10579-019-09457-5}
}

@article{pluss2022sds200,
  title={SDS-200: A Swiss German speech to standard German text corpus},
  author={Plüss, Michel and Hürlimann, Manuela and Cuny, Marc and Stöckli, Alla and Kapotis, Nikolaos and Hartmann, Julia and Ulasik, Malgorzata Anna and Scheller, Christian and Schraner, Yanick and Jain, Amit and Deriu, Jan and Cieliebak, Mark and Vogel, Manfred},
  booktitle={Proceedings of the Thirteenth Language Resources and Evaluation Conference},
  pages={3250--3256},
  year={2022},
  address={Marseille, France},
  publisher={European Language Resources Association}
}

@article{pluss2021spc,
  title={Swiss parliaments corpus, an automatically aligned swiss german speech to standard german text corpus},
  author={Plüss, Michel and Neukom, Lukas and Vogel, Manfred},
  journal={arXiv preprint arXiv:2010.02810},
  year={2020}
}

@article{pluss2023stt4sg,
  title={STT4SG-350: A Speech Corpus for Swiss German with Standard German Translations},
  author={Plüss, Michel and Neukom, Lukas and Scheller, Christian and Vogel, Manfred},
  journal={arXiv preprint arXiv:2305.13179},
  year={2023}
}

Acknowledgments

  • OpenAI for the original Whisper model
  • Hugging Face for the Transformers library and model hosting
  • Swiss German speech dataset contributors for providing high-quality training data:
    • SwissDial-Zh v1.1: Pelin Dogan-Schönberger, Julian Mäder, Thomas Hofmann (ETH Zurich)
    • Swiss Parliament Corpus V2 (SPC): FHNW University of Applied Sciences and Arts Northwestern Switzerland
    • SDS-200 Corpus: Research community for comprehensive Swiss German dialect coverage
    • ArchiMob Corpus: Tanja Samardžić, Yves Scherrer, Elvira Glaser (University of Zurich)

License

This model is released under the Apache 2.0 license. The original Whisper model is also under Apache 2.0.

Technical Specifications

  • Architecture: Transformer encoder-decoder
  • Input: 16kHz mono audio
  • Output: Standard German text
  • Context Length: 30 seconds
  • Sampling Rate: 16,000 Hz
  • Feature Extraction: 128 Mel-frequency bins
  • Vocabulary Size: 51,865 tokens

Related Models

Downloads last month
170
Safetensors
Model size
809M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Flurin17/whisper-large-v3-turbo-swiss-german

Finetuned
(279)
this model