Update README.md

94db4b4 verified about 1 month ago

7.52 kB

	---
	language: en
	license: mit
	tags:
	- audio
	- automatic-speech-recognition
	- whisper
	- atc
	- aviation
	datasets:
	- jlvdoorn/atco2-asr-atcosim
	metrics:
	- wer
	model-index:
	- name: whisper-large-v3-turbo-atcosim-finetune
	results:
	- task:
	type: automatic-speech-recognition
	name: Speech Recognition
	dataset:
	type: jlvdoorn/atco2-asr-atcosim
	name: ATCOSIM
	metrics:
	- type: wer
	value: 3.73
	name: Word Error Rate
	library_name: transformers
	pipeline_tag: automatic-speech-recognition
	inference:
	parameters:
	chunk_length_s: 30
	batch_size: 16
	return_timestamps: false
	widget:
	- example_title: ATC Sample 1
	src: https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-1.wav
	- example_title: ATC Sample 2
	src: https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-2.wav
	- example_title: ATC Sample 3
	src: https://huggingface.co/spaces/tclin/atc-whisper-transcriber/resolve/main/atc-sample-3.wav
	---
	[![DOI](https://img.shields.io/badge/DOI-10.57967%2Fhf%2F5272-blue)](https://doi.org/10.57967/hf/5272)
	# Whisper Large V3 Turbo: Fine-tuned for ATC Domain

	## Model Description

	This model is a fine-tuned version of OpenAI's [Whisper Large V3 Turbo](https://huggingface.co/openai/whisper-large-v3-turbo) specifically optimized for Air Traffic Control (ATC) communications transcription.

	The model was fine-tuned on the [ATCOSIM dataset](https://huggingface.co/datasets/jlvdoorn/atco2-asr-atcosim), which contains real ATC communications from operational environments.

	## Intended Use

	This model is designed for:
	- Transcribing ATC radio communications
	- Supporting aviation safety research
	- Analyzing ATC communications for congestion patterns
	- Enabling data-driven decision making in airspace management

	## Training Methodology

	The model was fine-tuned using a partial freezing approach to balance efficiency and adaptability:
	- First 24 encoder layers were frozen
	- All convolution layers and positional embeddings were frozen
	- Later encoder layers and decoder were fine-tuned

	Training hyperparameters:
	- Learning rate: 1e-5
	- Training steps: 5000
	- Warmup steps: 500
	- Gradient checkpointing enabled
	- FP16 precision
	- Batch size: 16 per device
	- Evaluation metric: Word Error Rate (WER)

	## Performance

	The model achieves improved transcription accuracy on aviation communications compared to the base Whisper model, with particular improvements in:
	- ATC terminology recognition
	- Callsign transcription accuracy
	- Handling of radio transmission noise
	- Recognition of standardized phraseology

	### Training Metrics

	Training progress over 5000 steps (10 epochs):

	\| Step \| Training Loss \| Validation Loss \| WER \|
	\|------\|---------------\|----------------\|---------\|
	\| 1000 \| 0.090100 \| 0.081074 \| 5.81697 \|
	\| 2000 \| 0.021100 \| 0.080030 \| 4.00939 \|
	\| 3000 \| 0.010000 \| 0.080892 \| 5.67438 \|
	\| 4000 \| 0.002500 \| 0.080460 \| 3.88357 \|
	\| 5000 \| 0.001400 \| 0.080753 \| 3.73678 \|

	The final model achieves a Word Error Rate (WER) of 3.73678%, showing significant improvement throughout the training process and demonstrating strong performance on ATC communications.

	## Limitations

	- The model is specifically optimized for English ATC communications
	- Performance may vary across different accents and regional phraseologies
	- Not optimized for general speech recognition outside the aviation domain
	- May struggle with extremely noisy transmissions or overlapping communications

	## Usage

	### Basic Usage with Pipeline

	```python
	import torch
	from transformers import pipeline

	# Configure device and precision
	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

	# Load the model with pipeline
	transcriber = pipeline(
	"automatic-speech-recognition",
	model="tclin/whisper-large-v3-turbo-atcosim-finetune",
	chunk_length_s=30,
	max_new_tokens=128,
	torch_dtype=torch_dtype,
	device=device
	)

	# Transcribe audio file
	result = transcriber("path_to_atc_audio.wav")
	print(f"Transcription: {result['text']}")
	```

	### Advanced Usage with Audio Processing

	```python
	import torch
	import torchaudio
	from transformers import WhisperProcessor, WhisperForConditionalGeneration

	# Load and preprocess audio
	audio_path = "path_to_atc_audio.wav"
	waveform, sample_rate = torchaudio.load(audio_path)

	# Resample to 16kHz (required for Whisper models)
	if sample_rate != 16000:
	resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
	waveform = resampler(waveform)

	# Convert stereo to mono if needed
	if waveform.shape[0] > 1:
	waveform = waveform.mean(dim=0, keepdim=True)

	# Convert to numpy array
	waveform_np = waveform.squeeze().cpu().numpy()

	# Configure device and precision
	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

	# Load model and processor
	model = WhisperForConditionalGeneration.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune")
	model = model.to(device=device, dtype=torch_dtype) # Explicit device and dtype setting
	processor = WhisperProcessor.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune")

	# Method 1: Using processor directly (recommended for precise control)
	input_features = processor(waveform_np, sampling_rate=16000, return_tensors="pt").input_features
	input_features = input_features.to(device=device, dtype=torch_dtype)

	generated_ids = model.generate(input_features, max_new_tokens=128)
	transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(f"Transcription: {transcription}")

	# Method 2: Using pipeline with preprocessed audio
	from transformers import pipeline

	pipe = pipeline(
	"automatic-speech-recognition",
	model=model,
	tokenizer=processor.tokenizer,
	feature_extractor=processor.feature_extractor,
	max_new_tokens=128,
	chunk_length_s=30,
	torch_dtype=torch_dtype,
	device=device
	)

	result = pipe(waveform_np)
	print(f"Transcription: {result['text']}")
	```

	### Important Notes

	- Always ensure audio is resampled to 16kHz before processing
	- Explicitly set both device and dtype when using GPU with `model.to(device=device, dtype=torch_dtype)`
	- For processing longer audio files, use the `chunk_length_s` parameter
	- The model performs best on clean ATC communications with standard phraseology

	## Broader Application

	This model serves as a component in a larger speech-to-analysis pipeline for ATC communications that includes:
	1. Audio-to-text transcription (this model)
	2. Domain-specific text reformatting using contextual knowledge
	3. Congestion analysis based on transcribed communications

	## Citation

	If you use this model in your research, please cite:

	```
	@misc{ta-chun_lin_2025,
	author = { Ta-Chun Lin },
	title = { whisper-large-v3-turbo-atcosim-finetune (Revision 4b2d400) },
	year = 2025,
	url = { https://huggingface.co/tclin/whisper-large-v3-turbo-atcosim-finetune },
	doi = { 10.57967/hf/5272 },
	publisher = { Hugging Face }
	}
	```

	## Acknowledgments

	- OpenAI for the base Whisper model
	- The ATCOSIM dataset for providing high-quality ATC communications data
	- The open-source community for tools and frameworks that made this fine-tuning possible