|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- WpythonW/real-fake-voices-dataset2 |
|
- mozilla-foundation/common_voice_17_0 |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
- f1 |
|
- recall |
|
- precision |
|
base_model: |
|
- MIT/ast-finetuned-audioset-10-10-0.4593 |
|
pipeline_tag: audio-classification |
|
library_name: transformers |
|
tags: |
|
- audio |
|
- audio-classification |
|
- fake-audio-detection |
|
- ast |
|
widget: |
|
- text: Upload an audio file to check if it's real or synthetic |
|
inference: |
|
parameters: |
|
sampling_rate: 16000 |
|
audio_channel: mono |
|
model-index: |
|
- name: ast-fakeaudio-detector |
|
results: |
|
- task: |
|
type: audio-classification |
|
name: Audio Classification |
|
dataset: |
|
name: real-fake-voices-dataset2 |
|
type: WpythonW/real-fake-voices-dataset2 |
|
metrics: |
|
- type: accuracy |
|
value: 0.9662 |
|
- type: f1 |
|
value: 0.971 |
|
- type: precision |
|
value: 0.9692 |
|
- type: recall |
|
value: 0.9728 |
|
--- |
|
|
|
# AST Fine-tuned for Fake Audio Detection |
|
|
|
This model is a binary classification head fine-tuned version of [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) for detecting fake/synthetic audio. The original AST (Audio Spectrogram Transformer) classification head was replaced with a binary classification layer optimized for fake audio detection. |
|
|
|
## Model Description |
|
|
|
- **Base Model**: MIT/ast-finetuned-audioset-10-10-0.4593 (AST pretrained on AudioSet) |
|
- **Task**: Binary classification (fake/real audio detection) |
|
- **Input**: Audio converted to Mel spectrogram (128 mel bins, 1024 time frames) |
|
- **Output**: Probabilities [fake_prob, real_prob] |
|
- **Training Hardware**: 2x NVIDIA T4 GPUs |
|
|
|
# Usage Guide |
|
|
|
## Model Usage |
|
```python |
|
import torch |
|
import torchaudio |
|
import soundfile as sf |
|
import numpy as np |
|
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification |
|
|
|
# Load model and move to available device |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model_name = "WpythonW/ast-fakeaudio-detector" |
|
|
|
extractor = AutoFeatureExtractor.from_pretrained(model_name) |
|
model = AutoModelForAudioClassification.from_pretrained(model_name).to(device) |
|
model.eval() |
|
|
|
# Process multiple audio files |
|
audio_files = ["audio1.wav", "audio2.mp3", "audio3.ogg"] |
|
processed_batch = [] |
|
|
|
for audio_path in audio_files: |
|
# Load audio file |
|
audio_data, sr = sf.read(audio_path) |
|
|
|
# Convert stereo to mono if needed |
|
if len(audio_data.shape) > 1 and audio_data.shape[1] > 1: |
|
audio_data = np.mean(audio_data, axis=1) |
|
|
|
# Resample to 16kHz if needed |
|
if sr != 16000: |
|
waveform = torch.from_numpy(audio_data).float() |
|
if len(waveform.shape) == 1: |
|
waveform = waveform.unsqueeze(0) |
|
|
|
resample = torchaudio.transforms.Resample( |
|
orig_freq=sr, |
|
new_freq=16000 |
|
) |
|
waveform = resample(waveform) |
|
audio_data = waveform.squeeze().numpy() |
|
|
|
processed_batch.append(audio_data) |
|
|
|
# Prepare batch input |
|
inputs = extractor( |
|
processed_batch, |
|
sampling_rate=16000, |
|
padding=True, |
|
return_tensors="pt" |
|
) |
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
|
|
# Get predictions |
|
with torch.no_grad(): |
|
logits = model(**inputs).logits |
|
probabilities = torch.nn.functional.softmax(logits, dim=-1) |
|
|
|
# Process results |
|
for filename, probs in zip(audio_files, probabilities): |
|
fake_prob = float(probs[0].cpu()) |
|
real_prob = float(probs[1].cpu()) |
|
prediction = "FAKE" if fake_prob > real_prob else "REAL" |
|
|
|
print(f"\nFile: {filename}") |
|
print(f"Fake probability: {fake_prob:.2%}") |
|
print(f"Real probability: {real_prob:.2%}") |
|
print(f"Verdict: {prediction}") |
|
``` |
|
|
|
## Limitations |
|
|
|
Important considerations when using this model: |
|
1. The model works with 16kHz audio input |
|
2. Performance may vary with different types of audio manipulation not present in training data |
|
3. The model was trained on audio samples ranging from 4 to 10 seconds in duration. |