Speech Emotion Recognition By Fine-Tuning Wav2Vec 2.0
The model is a fine-tuned version of jonatasgrosman/wav2vec2-large-xlsr-53-english for a Speech Emotion Recognition (SER) task.
Several datasets were used the fine-tune the original model:
- Surrey Audio-Visual Expressed Emotion (SAVEE) - 480 audio files from 4 male actors
- Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) - 1440 audio files from 24 professional actors (12 female, 12 male)
- Toronto emotional speech set (TESS) - 2800 audio files from 2 female actors
7 labels/emotions were used as classification labels
emotions = ['angry' 'disgust' 'fear' 'happy' 'neutral' 'sad' 'surprise']
It achieves the following results on the evaluation set:
- Loss: 0.104075
- Accuracy: 0.97463
Model Usage
pip install transformers librosa torch
from transformers import *
import librosa
import torch
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
model = Wav2Vec2ForCTC.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
def predict_emotion(audio_path):
audio, rate = librosa.load(audio_path, sr=16000)
inputs = feature_extractor(audio, sampling_rate=rate, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(inputs.input_values)
predictions = torch.nn.functional.softmax(outputs.logits.mean(dim=1), dim=-1) # Average over sequence length
predicted_label = torch.argmax(predictions, dim=-1)
emotion = model.config.id2label[predicted_label.item()]
return emotion
emotion = predict_emotion("example_audio.wav")
print(f"Predicted emotion: {emotion}")
>> Predicted emotion: angry
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 4
- eval_batch_size: 4
- eval_steps: 500
- seed: 42
- gradient_accumulation_steps: 2
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- num_epochs: 4
- max_steps=7500
- save_steps: 1500
Training results
Step | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
500 | 1.8124 | 1.365212 | 0.486258 |
1000 | 0.8872 | 0.773145 | 0.79704 |
1500 | 0.7035 | 0.574954 | 0.852008 |
2000 | 0.6879 | 1.286738 | 0.775899 |
2500 | 0.6498 | 0.697455 | 0.832981 |
3000 | 0.5696 | 0.33724 | 0.892178 |
3500 | 0.4218 | 0.307072 | 0.911205 |
4000 | 0.3088 | 0.374443 | 0.930233 |
4500 | 0.2688 | 0.260444 | 0.936575 |
5000 | 0.2973 | 0.302985 | 0.92389 |
5500 | 0.1765 | 0.165439 | 0.961945 |
6000 | 0.1475 | 0.170199 | 0.961945 |
6500 | 0.1274 | 0.15531 | 0.966173 |
7000 | 0.0699 | 0.103882 | 0.976744 |
7500 | 0.083 | 0.104075 | 0.97463 |
- Downloads last month
- 2,381
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.