|
--- |
|
language: en |
|
license: mit |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
- whisper |
|
- atc |
|
- aviation |
|
datasets: |
|
- jlvdoorn/atco2-asr-atcosim |
|
metrics: |
|
- wer |
|
model-index: |
|
- name: whisper-large-v3-turbo-atcosim-finetune |
|
results: |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Speech Recognition |
|
dataset: |
|
type: jlvdoorn/atco2-asr-atcosim |
|
name: ATCOSIM |
|
metrics: |
|
- type: wer |
|
value: 3.73 |
|
name: Word Error Rate |
|
library_name: transformers |
|
pipeline_tag: automatic-speech-recognition |
|
inference: |
|
parameters: |
|
chunk_length_s: 30 |
|
batch_size: 16 |
|
return_timestamps: false |
|
widget: |
|
- example_title: ATC Sample 1 |
|
src: https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-1.wav |
|
- example_title: ATC Sample 2 |
|
src: https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-2.wav |
|
- example_title: ATC Sample 3 |
|
src: https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-3.wav |
|
--- |
|
[](https://doi.org/10.57967/hf/5272) |
|
# Whisper Large V3 Turbo: Fine-tuned for ATC Domain |
|
|
|
## Model Description |
|
|
|
This model is a fine-tuned version of OpenAI's [Whisper Large V3 Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) specifically optimized for Air Traffic Control (ATC) communications transcription. |
|
|
|
The model was fine-tuned on the [ATCOSIM dataset](https://huggingface.co/datasets/jlvdoorn/atco2-asr-atcosim), which contains real ATC communications from operational environments. |
|
|
|
## Intended Use |
|
|
|
This model is designed for: |
|
- Transcribing ATC radio communications |
|
- Supporting aviation safety research |
|
- Analyzing ATC communications for congestion patterns |
|
- Enabling data-driven decision making in airspace management |
|
|
|
## Training Methodology |
|
|
|
The model was fine-tuned using a partial freezing approach to balance efficiency and adaptability: |
|
- First 24 encoder layers were frozen |
|
- All convolution layers and positional embeddings were frozen |
|
- Later encoder layers and decoder were fine-tuned |
|
|
|
Training hyperparameters: |
|
- Learning rate: 1e-5 |
|
- Training steps: 5000 |
|
- Warmup steps: 500 |
|
- Gradient checkpointing enabled |
|
- FP16 precision |
|
- Batch size: 16 per device |
|
- Evaluation metric: Word Error Rate (WER) |
|
|
|
## Performance |
|
|
|
The model achieves improved transcription accuracy on aviation communications compared to the base Whisper model, with particular improvements in: |
|
- ATC terminology recognition |
|
- Callsign transcription accuracy |
|
- Handling of radio transmission noise |
|
- Recognition of standardized phraseology |
|
|
|
### Training Metrics |
|
|
|
Training progress over 5000 steps (10 epochs): |
|
|
|
| Step | Training Loss | Validation Loss | WER | |
|
|------|---------------|----------------|---------| |
|
| 1000 | 0.090100 | 0.081074 | 5.81697 | |
|
| 2000 | 0.021100 | 0.080030 | 4.00939 | |
|
| 3000 | 0.010000 | 0.080892 | 5.67438 | |
|
| 4000 | 0.002500 | 0.080460 | 3.88357 | |
|
| 5000 | 0.001400 | 0.080753 | 3.73678 | |
|
|
|
The final model achieves a Word Error Rate (WER) of 3.73678%, showing significant improvement throughout the training process and demonstrating strong performance on ATC communications. |
|
|
|
## Limitations |
|
|
|
- The model is specifically optimized for English ATC communications |
|
- Performance may vary across different accents and regional phraseologies |
|
- Not optimized for general speech recognition outside the aviation domain |
|
- May struggle with extremely noisy transmissions or overlapping communications |
|
|
|
## Usage |
|
|
|
### Basic Usage with Pipeline |
|
|
|
```python |
|
import torch |
|
from transformers import pipeline |
|
|
|
# Configure device and precision |
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 |
|
|
|
# Load the model with pipeline |
|
transcriber = pipeline( |
|
"automatic-speech-recognition", |
|
model="tclin/whisper-large-v3-turbo-atcosim-finetune", |
|
chunk_length_s=30, |
|
max_new_tokens=128, |
|
torch_dtype=torch_dtype, |
|
device=device |
|
) |
|
|
|
# Transcribe audio file |
|
result = transcriber("path_to_atc_audio.wav") |
|
print(f"Transcription: {result['text']}") |
|
``` |
|
|
|
### Advanced Usage with Audio Processing |
|
|
|
```python |
|
import torch |
|
import torchaudio |
|
from transformers import WhisperProcessor, WhisperForConditionalGeneration |
|
|
|
# Load and preprocess audio |
|
audio_path = "path_to_atc_audio.wav" |
|
waveform, sample_rate = torchaudio.load(audio_path) |
|
|
|
# Resample to 16kHz (required for Whisper models) |
|
if sample_rate != 16000: |
|
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000) |
|
waveform = resampler(waveform) |
|
|
|
# Convert stereo to mono if needed |
|
if waveform.shape[0] > 1: |
|
waveform = waveform.mean(dim=0, keepdim=True) |
|
|
|
# Convert to numpy array |
|
waveform_np = waveform.squeeze().cpu().numpy() |
|
|
|
# Configure device and precision |
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 |
|
|
|
# Load model and processor |
|
model = WhisperForConditionalGeneration.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune") |
|
model = model.to(device=device, dtype=torch_dtype) # Explicit device and dtype setting |
|
processor = WhisperProcessor.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune") |
|
|
|
# Method 1: Using processor directly (recommended for precise control) |
|
input_features = processor(waveform_np, sampling_rate=16000, return_tensors="pt").input_features |
|
input_features = input_features.to(device=device, dtype=torch_dtype) |
|
|
|
generated_ids = model.generate(input_features, max_new_tokens=128) |
|
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
print(f"Transcription: {transcription}") |
|
|
|
# Method 2: Using pipeline with preprocessed audio |
|
from transformers import pipeline |
|
|
|
pipe = pipeline( |
|
"automatic-speech-recognition", |
|
model=model, |
|
tokenizer=processor.tokenizer, |
|
feature_extractor=processor.feature_extractor, |
|
max_new_tokens=128, |
|
chunk_length_s=30, |
|
torch_dtype=torch_dtype, |
|
device=device |
|
) |
|
|
|
result = pipe(waveform_np) |
|
print(f"Transcription: {result['text']}") |
|
``` |
|
|
|
### Important Notes |
|
|
|
- Always ensure audio is resampled to 16kHz before processing |
|
- Explicitly set both device and dtype when using GPU with `model.to(device=device, dtype=torch_dtype)` |
|
- For processing longer audio files, use the `chunk_length_s` parameter |
|
- The model performs best on clean ATC communications with standard phraseology |
|
|
|
## Broader Application |
|
|
|
This model serves as a component in a larger speech-to-analysis pipeline for ATC communications that includes: |
|
1. Audio-to-text transcription (this model) |
|
2. Domain-specific text reformatting using contextual knowledge |
|
3. Congestion analysis based on transcribed communications |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
``` |
|
@misc{ta-chun_lin_2025, |
|
author = { Ta-Chun Lin }, |
|
title = { whisper-large-v3-turbo-atcosim-finetune (Revision 4b2d400) }, |
|
year = 2025, |
|
url = { https://huggingface.co/tclin/whisper-large-v3-turbo-atcosim-finetune }, |
|
doi = { 10.57967/hf/5272 }, |
|
publisher = { Hugging Face } |
|
} |
|
``` |
|
|
|
## Acknowledgments |
|
|
|
- OpenAI for the base Whisper model |
|
- The ATCOSIM dataset for providing high-quality ATC communications data |
|
- The open-source community for tools and frameworks that made this fine-tuning possible |