--- license: apache-2.0 datasets: - 012shin/fake-audio-detection-augmented language: - en metrics: - accuracy - f1 - recall - precision base_model: - MIT/ast-finetuned-audioset-10-10-0.4593 pipeline_tag: audio-classification library_name: transformers tags: - audio - audio-classification - fake-audio-detection - ast model-index: - name: ast-fakeaudio-detector results: - task: type: audio-classification name: Audio Classification dataset: name: fake-audio-detection-augmented type: 012shin/fake-audio-detection-augmented metrics: - type: accuracy value: 0.9662 - type: f1 value: 0.9710 - type: precision value: 0.9692 - type: recall value: 0.9728 --- # AST Fine-tuned for Fake Audio Detection This model is a binary classification head fine-tuned version of [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) for detecting fake/synthetic audio. The original AST (Audio Spectrogram Transformer) classification head was replaced with a binary classification layer optimized for fake audio detection. ## Model Description - **Base Model**: MIT/ast-finetuned-audioset-10-10-0.4593 (AST pretrained on AudioSet) - **Task**: Binary classification (fake/real audio detection) - **Input**: Audio converted to Mel spectrogram (128 mel bins, 1024 time frames) - **Output**: Binary prediction (0: real audio, 1: fake audio) - **Training Hardware**: 2x NVIDIA T4 GPUs ## Training Configuration ```python { 'learning_rate': 1e-5, 'weight_decay': 0.01, 'n_iterations': 1500, 'batch_size': 16, 'gradient_accumulation_steps': 8, 'validate_every': 500, 'val_samples': 5000 } ``` ## Dataset Distribution The model was trained on a filtered dataset with the following class distribution: ``` Training Set: - Fake Audio (0): 29,089 samples (53.97%) - Real Audio (1): 24,813 samples (46.03%) Test Set: - Fake Audio (0): 7,229 samples (53.64%) - Real Audio (1): 6,247 samples (46.36%) ``` ## Model Performance Final metrics on validation set: - Accuracy: 0.9662 (96.62%) - F1 Score: 0.9710 (97.10%) - Precision: 0.9692 (96.92%) - Recall: 0.9728 (97.28%) # Usage Guide ## Model Usage ```python from transformers import AutoFeatureExtractor, AutoModelForAudioClassification import torchaudio import torch # Load audio file waveform, sample_rate = torchaudio.load("path_to_audio.ogg") # Initialize model and feature extractor model_name = "WpythonW/ast-fakeaudio-detector" extractor = AutoFeatureExtractor.from_pretrained(model_name) model = AutoModelForAudioClassification.from_pretrained(model_name) # Process audio and get predictions inputs = extractor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits probabilities = torch.nn.functional.softmax(logits, dim=-1) print(f"Probability of fake audio: {probabilities[0][0]:.2%}") ``` ## Limitations Important considerations when using this model: 1. The model works best with 16kHz audio input 2. Performance may vary with different types of audio manipulation not present in training data 3. Very short audio clips (<1 second) might not provide reliable results 4. The model should not be used as the sole determiner for real/fake audio detection ## Training Details The training process involved: 1. Loading the base AST model pretrained on AudioSet 2. Replacing the classification head with a binary classifier 3. Fine-tuning on the fake audio detection dataset for 1500 iterations 4. Using gradient accumulation (8 steps) with batch size 16 5. Implementing validation checks every 500 steps