|
--- |
|
datasets: |
|
- narad/ravdess |
|
language: |
|
- en |
|
metrics: |
|
- f1 |
|
- accuracy |
|
- recall |
|
- precision |
|
pipeline_tag: audio-classification |
|
--- |
|
|
|
# Emotion Recognition in English Using RAVDESS and Wav2Vec 2.0 |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
This model extracts emotions from audio recordings. It was trained on RAVDESS, a dataset containing English audio recordings. The model recognises six emotions: anger, disgust, fear, happiness, sadness and surprise. |
|
|
|
The model recreates the work of this [Greek emotion extractor](https://huggingface.co/m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition/blob/main/README.md) using a pre-trained [Wav2Vec2](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english) model to process the data. |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
- **Adapted from:** [Emotion Recognition in Greek](https://huggingface.co/m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition/blob/main/README.md) |
|
- **Model type:** NN with CTC |
|
- **Language(s) (NLP):** English |
|
- **Finetuned from model:** [wav2vec2](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english) |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
[More Information Needed] |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
The RAVDESS dataset was split into training, validation and test sets with 60, 20 and 20 splits, respectively. |
|
|
|
### Training Procedure |
|
|
|
The fine-tuning process was centred on four hyper-parameters: |
|
- the number of batches (4, 8), |
|
- gradient accumulation steps (GAS) (2, 4, 6, 8), |
|
- number of epochs (10, 20) and |
|
- the learning rate (1e-3, 1e-4, 1e-5). |
|
|
|
Each experiment was repeated 10 times. |
|
|
|
## Evaluation |
|
|
|
The set of hyper-parameters resulting in the best performance is: 4 batches, 4 GAS, 10 epochs and 1e-4 learning rate |
|
|
|
## Testing |
|
|
|
The model was retrained on the combined train and validation sets using the best hyper-parameter set. The performance on the test set has an average Accuracy and F1 scores of 84.84% (SD 2 and 2.08, respectively) |
|
|
|
|
|
## Results |
|
|
|
We retained the model providing the highest performance over the 10 runs. |
|
|
|
| Emotion | Accuracy | Precision | Recall | F1 | |
|
|-----------|:-------:|-----------:|---------:|---------:| |
|
| Anger | | 96.55 | 87.50 | | |
|
| Disgust | | 90.91 | 93.75 | | |
|
| Fear | | 96.30 | 81.25 | | |
|
| Happiness | | 93.10 | 84.38 | | |
|
| Sad | | 81.58 | 96.88 | | |
|
| Surprise | | 77.78 | 87.50 | | |
|
| Total | 88.54 | 89.37 | 88.54 | 88.62 | |
|
|
|
|
|
<!-- ## Citation [optional] --> |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
<!-- **BibTeX:** |
|
|
|
[More Information Needed] |
|
|
|
**APA:** |
|
|
|
[More Information Needed] --> |