--- license: apache-2.0 datasets: - 012shin/fake-audio-detection-augmented language: - en metrics: - accuracy - f1 - recall - precision base_model: - MIT/ast-finetuned-audioset-10-10-0.4593 pipeline_tag: audio-classification library_name: transformers tags: - audio - audio-classification - fake-audio-detection - ast widget: - text: "Upload an audio file to check if it's real or synthetic" inference: parameters: sampling_rate: 16000 audio_channel: "mono" model-index: - name: ast-fakeaudio-detector results: - task: type: audio-classification name: Audio Classification dataset: name: fake-audio-detection-augmented type: 012shin/fake-audio-detection-augmented metrics: - type: accuracy value: 0.9662 - type: f1 value: 0.9710 - type: precision value: 0.9692 - type: recall value: 0.9728 --- # AST Fine-tuned for Fake Audio Detection This model is a binary classification head fine-tuned version of [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) for detecting fake/synthetic audio. The original AST (Audio Spectrogram Transformer) classification head was replaced with a binary classification layer optimized for fake audio detection. ## Model Description - **Base Model**: MIT/ast-finetuned-audioset-10-10-0.4593 (AST pretrained on AudioSet) - **Task**: Binary classification (fake/real audio detection) - **Input**: Audio converted to Mel spectrogram (128 mel bins, 1024 time frames) - **Output**: Binary prediction (0: real audio, 1: fake audio) - **Training Hardware**: 2x NVIDIA T4 GPUs ## Training Configuration ```python { 'learning_rate': 1e-5, 'weight_decay': 0.01, 'n_iterations': 1500, 'batch_size': 16, 'gradient_accumulation_steps': 8, 'validate_every': 500, 'val_samples': 5000 } ``` ## Dataset Distribution The model was trained on a filtered dataset with the following class distribution: ``` Training Set: - Fake Audio (0): 29,089 samples (53.97%) - Real Audio (1): 24,813 samples (46.03%) Test Set: - Fake Audio (0): 7,229 samples (53.64%) - Real Audio (1): 6,247 samples (46.36%) ``` ## Model Performance Final metrics on validation set: - Accuracy: 0.9662 (96.62%) - F1 Score: 0.9710 (97.10%) - Precision: 0.9692 (96.92%) - Recall: 0.9728 (97.28%) # Usage Guide ## Model Usage ```python import torch import torchaudio import soundfile as sf import numpy as np from transformers import AutoFeatureExtractor, AutoModelForAudioClassification # Load model and move to available device device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model_name = "WpythonW/ast-fakeaudio-detector" extractor = AutoFeatureExtractor.from_pretrained(model_name) model = AutoModelForAudioClassification.from_pretrained(model_name).to(device) model.eval() # Process multiple audio files audio_files = ["audio1.wav", "audio2.mp3", "audio3.ogg"] processed_batch = [] for audio_path in audio_files: # Load audio file audio_data, sr = sf.read(audio_path) # Convert stereo to mono if needed if len(audio_data.shape) > 1 and audio_data.shape[1] > 1: audio_data = np.mean(audio_data, axis=1) # Resample to 16kHz if needed if sr != 16000: waveform = torch.from_numpy(audio_data).float() if len(waveform.shape) == 1: waveform = waveform.unsqueeze(0) resample = torchaudio.transforms.Resample( orig_freq=sr, new_freq=16000 ) waveform = resample(waveform) audio_data = waveform.squeeze().numpy() processed_batch.append(audio_data) # Prepare batch input inputs = extractor( processed_batch, sampling_rate=16000, padding=True, return_tensors="pt" ) inputs = {k: v.to(device) for k, v in inputs.items()} # Get predictions with torch.no_grad(): logits = model(**inputs).logits probabilities = torch.nn.functional.softmax(logits, dim=-1) # Process results for filename, probs in zip(audio_files, probabilities): fake_prob = float(probs[0].cpu()) real_prob = float(probs[1].cpu()) prediction = "FAKE" if fake_prob > real_prob else "REAL" print(f"\nFile: {filename}") print(f"Fake probability: {fake_prob:.2%}") print(f"Real probability: {real_prob:.2%}") print(f"Verdict: {prediction}") ``` ## Limitations Important considerations when using this model: 1. The model works best with 16kHz audio input 2. Performance may vary with different types of audio manipulation not present in training data 3. Very short audio clips (<1 second) might not provide reliable results 4. The model should not be used as the sole determiner for real/fake audio detection ## Training Details The training process involved: 1. Loading the base AST model pretrained on AudioSet 2. Replacing the classification head with a binary classifier 3. Fine-tuning on the fake audio detection dataset for 1500 iterations 4. Using gradient accumulation (8 steps) with batch size 16 5. Implementing validation checks every 500 steps