tiantiaf's picture
Update README.md
b92dab6 verified
---
base_model:
- openai/whisper-large-v3
language:
- en
license: openrail
metrics:
- f1
pipeline_tag: audio-classification
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
- speech_emotion_recognition
library_name: transformers
---
# Whisper-Large V3 for Categorical Emotion Classification
# Model Description
This model includes the implementation of categorical emotion classification described in Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits (https://arxiv.org/pdf/2505.14648)
The training pipeline used is also the top-performing solution (SAILER) in INTERSPEECH 2025—Speech Emotion Challenge (https://lab-msp.com/MSP-Podcast_Competition/IS2025/).
Note that we did not use all the augmentation and did not use the transcript compared to our official challenge submission system, but we created a speech-only system to make the model simple but still effective.
We use the MSP-Podcast data to train this model, noting that the model might be sensitive to content information when making emotion predictions. However, this could be a good feature for classifying emotions from online content.
The included emotions are:
<pre>
[
'Anger',
'Contempt',
'Disgust',
'Fear',
'Happiness',
'Neutral',
'Sadness',
'Surprise',
'Other'
]
</pre>
- Library: https://github.com/tiantiaf0627/vox-profile-release
# How to use this model
## Download repo
```
git clone [email protected]:tiantiaf0627/vox-profile-release.git
```
## Install the package
```
conda create -n vox_profile python=3.8
cd vox-profile-release
pip install -e .
```
## Load the model
```python
# Load libraries
import torch
import torch.nn.functional as F
from src.model.emotion.whisper_emotion import WhisperWrapper
# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
# Load model from Huggingface
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-msp-podcast-emotion").to(device)
model.eval()
```
## Prediction
```python
# Label List
emotion_label_list = [
'Anger',
'Contempt',
'Disgust',
'Fear',
'Happiness',
'Neutral',
'Sadness',
'Surprise',
'Other'
]
# Load data, here just zeros as the example
# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
logits, embedding, _, _, _, _ = model(
data, return_feature=True
)
# Probability and output
emotion_prob = F.softmax(logits, dim=1)
print(emotion_label_list[torch.argmax(emotion_prob).detach().cpu().item()])
```
## If you have any questions, please contact: Tiantian Feng ([email protected])
## Kindly cite our paper if you are using our model or find it useful in your work
```
@article{feng2025vox,
title={Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits},
author={Feng, Tiantian and Lee, Jihwan and Xu, Anfeng and Lee, Yoonjeong and Lertpetchpun, Thanathai and Shi, Xuan and Wang, Helin and Thebaud, Thomas and Moro-Velazquez, Laureano and Byrd, Dani and others},
journal={arXiv preprint arXiv:2505.14648},
year={2025}
}
```
Responsible use of the Model: the Model is released under Open RAIL license, and users should respect the privacy and consent of the data subjects, and adhere to the relevant laws and regulations in their jurisdictions in using our model.
**Out-of-Scope Use**
- Clinical or diagnostic applications
- Surveillance
- Privacy-invasive applications
- No commercial use