File size: 3,766 Bytes
0b29812
a8d890e
 
8db0ea8
 
f77c9d6
8db0ea8
7d6982d
 
a8d890e
 
 
 
 
0b29812
a8d890e
8db0ea8
 
 
 
 
5c04ab5
 
d076531
 
8db0ea8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0b29812
 
44d6db4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae5f7ec
44d6db4
 
 
 
 
 
 
 
 
 
 
 
ae5f7ec
44d6db4
 
 
 
 
 
 
 
 
 
 
 
 
c2d8e50
 
 
 
 
8eccc91
44d6db4
 
 
 
 
 
f5f7fec
 
117d3ba
 
 
 
 
 
 
 
 
 
3ea1088
 
5726248
 
 
 
 
b92dab6
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
base_model:
- openai/whisper-large-v3
language:
- en
license: openrail
metrics:
- f1
pipeline_tag: audio-classification
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
- speech_emotion_recognition
library_name: transformers
---

# Whisper-Large V3 for Categorical Emotion Classification

# Model Description
This model includes the implementation of categorical emotion classification described in Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits (https://arxiv.org/pdf/2505.14648)

The training pipeline used is also the top-performing solution (SAILER) in INTERSPEECH 2025—Speech Emotion Challenge (https://lab-msp.com/MSP-Podcast_Competition/IS2025/). 
Note that we did not use all the augmentation and did not use the transcript compared to our official challenge submission system, but we created a speech-only system to make the model simple but still effective.

We use the MSP-Podcast data to train this model, noting that the model might be sensitive to content information when making emotion predictions. However, this could be a good feature for classifying emotions from online content.


The included emotions are: 
<pre>
[
    'Anger', 
    'Contempt', 
    'Disgust', 
    'Fear', 
    'Happiness', 
    'Neutral', 
    'Sadness', 
    'Surprise', 
    'Other'
]
</pre>

- Library: https://github.com/tiantiaf0627/vox-profile-release

# How to use this model

## Download repo
```
git clone [email protected]:tiantiaf0627/vox-profile-release.git
```
## Install the package
```
conda create -n vox_profile python=3.8
cd vox-profile-release
pip install -e .
```

## Load the model
```python
# Load libraries
import torch
import torch.nn.functional as F
from src.model.emotion.whisper_emotion import WhisperWrapper
# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
# Load model from Huggingface
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-msp-podcast-emotion").to(device)
model.eval()
```

## Prediction
```python
# Label List
emotion_label_list = [
    'Anger', 
    'Contempt', 
    'Disgust', 
    'Fear', 
    'Happiness', 
    'Neutral', 
    'Sadness', 
    'Surprise', 
    'Other'
]
    
# Load data, here just zeros as the example
# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
logits, embedding, _, _, _, _ = model(
    data, return_feature=True
)
    
# Probability and output
emotion_prob = F.softmax(logits, dim=1)
print(emotion_label_list[torch.argmax(emotion_prob).detach().cpu().item()])
```

## If you have any questions, please contact: Tiantian Feng ([email protected])

## Kindly cite our paper if you are using our model or find it useful in your work
```
@article{feng2025vox,
  title={Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits},
  author={Feng, Tiantian and Lee, Jihwan and Xu, Anfeng and Lee, Yoonjeong and Lertpetchpun, Thanathai and Shi, Xuan and Wang, Helin and Thebaud, Thomas and Moro-Velazquez, Laureano and Byrd, Dani and others},
  journal={arXiv preprint arXiv:2505.14648},
  year={2025}
}
```

Responsible use of the Model: the Model is released under Open RAIL license, and users should respect the privacy and consent of the data subjects, and adhere to the relevant laws and regulations in their jurisdictions in using our model.

❌ **Out-of-Scope Use**
- Clinical or diagnostic applications
- Surveillance
- Privacy-invasive applications
- No commercial use