File size: 3,405 Bytes
20217d5
 
 
 
 
 
 
129d357
20217d5
 
b9e996e
20217d5
 
a101005
 
 
20217d5
129d357
20217d5
 
 
 
129d357
 
2c59b3f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20217d5
 
 
 
 
 
129d357
20217d5
 
 
129d357
 
 
20217d5
 
574f7d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
license: apache-2.0
tags:
- generated_from_trainer
metrics:
- accuracy
model_index:
  name: wav2vec-english-speech-emotion-recognition
---
# Speech Emotion Recognition By Fine-Tuning Wav2Vec 2.0
The model is a fine-tuned version of [jonatasgrosman/wav2vec2-large-xlsr-53-english](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english) for a Speech Emotion Recognition (SER) task.

Several datasets were used the fine-tune the original model:  
- Surrey Audio-Visual Expressed Emotion [(SAVEE)](http://kahlan.eps.surrey.ac.uk/savee/Database.html) - 480 audio files from 4 male actors
- Ryerson Audio-Visual Database of Emotional Speech and Song [(RAVDESS)](https://zenodo.org/record/1188976) - 1440 audio files from 24 professional actors (12 female, 12 male)
- Toronto emotional speech set [(TESS)](https://tspace.library.utoronto.ca/handle/1807/24487) - 2800 audio files from 2 female actors

7 labels/emotions were used as classification labels
```python
emotions = ['angry' 'disgust' 'fear' 'happy' 'neutral' 'sad' 'surprise']
```
It achieves the following results on the evaluation set:
- Loss: 0.104075
- Accuracy: 0.97463

## Model Usage
```bash
pip install transformers librosa torch
```
```python
from transformers import *
import librosa
import torch

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
model = Wav2Vec2ForCTC.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")

def predict_emotion(audio_path):
    audio, rate = librosa.load(audio_path, sr=16000)
    inputs = feature_extractor(audio, sampling_rate=rate, return_tensors="pt", padding=True)
    
    with torch.no_grad():
        outputs = model(inputs.input_values)
        predictions = torch.nn.functional.softmax(outputs.logits.mean(dim=1), dim=-1)  # Average over sequence length
        predicted_label = torch.argmax(predictions, dim=-1)
        emotion = model.config.id2label[predicted_label.item()]
    return emotion

emotion = predict_emotion("example_audio.wav")
print(f"Predicted emotion: {emotion}")
>> Predicted emotion: angry
```


## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 4
- eval_batch_size: 4
- eval_steps: 500
- seed: 42
- gradient_accumulation_steps: 2
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- num_epochs: 4
- max_steps=7500
- save_steps: 1500

### Training results
| Step | Training Loss | Validation Loss | Accuracy |
| ---- | ------------- | --------------- | -------- |
| 500  | 1.8124        | 1.365212        | 0.486258 |
| 1000 | 0.8872        | 0.773145        | 0.79704  |
| 1500 | 0.7035        | 0.574954        | 0.852008 |
| 2000 | 0.6879        | 1.286738        | 0.775899 |
| 2500 | 0.6498        | 0.697455        | 0.832981 |
| 3000 | 0.5696        | 0.33724         | 0.892178 |
| 3500 | 0.4218        | 0.307072        | 0.911205 |
| 4000 | 0.3088        | 0.374443        | 0.930233 |
| 4500 | 0.2688        | 0.260444        | 0.936575 |
| 5000 | 0.2973        | 0.302985        | 0.92389  |
| 5500 | 0.1765        | 0.165439        | 0.961945 |
| 6000 | 0.1475        | 0.170199        | 0.961945 |
| 6500 | 0.1274        | 0.15531         | 0.966173 |
| 7000 | 0.0699        | 0.103882        | 0.976744 |
| 7500 | 0.083         | 0.104075        | 0.97463  |