Model Card: aniemore-audio-finetuned
Model Summary
Aniemore/wavlm-emotion-russian-resd
fine-tuned on a realistically distributed Russian-language dataset of emotional speech for 7-class emotion classification. Training was conducted as part of the EchoStressAI project to evaluate model performance in real-world class imbalance scenarios.
Model Details
- Model type:
WavLMForSequenceClassification
- Pretrained base:
Aniemore/wavlm-emotion-russian-resd
- Fine-tuned dataset: Unbalanced natural dataset (from Dusha and EmoGator)
- Languages: Russian
- Task: Speech Emotion Recognition (SER)
Label Mapping
ID | Label |
---|---|
0 | Angry |
1 | Disgusted |
2 | Happy |
3 | Neutral |
4 | Sad |
5 | Scared |
6 | Surprised |
Training Details
- Epochs: 8 (checkpoint selected at step 135000)
- Batch size: 4
- Learning rate: 1e-5
- Optimizer: AdamW
- Loss function: CrossEntropyLoss
- Scheduler: Linear
- Monitoring: Weights & Biases (wandb)
- Mixed precision: Enabled (fp16)
- Resume logic:
resume_from_checkpoint=True
- Best model selection:
load_best_model_at_end=True
withmetric_for_best_model="f1"
Evaluation Results (Test Set, 5463 samples)
Metric | Value |
---|---|
Accuracy | 0.93 |
F1-score (macro avg) | 0.79 |
F1-score (weighted) | 0.93 |
Precision (macro) | 0.80 |
Recall (macro) | 0.79 |
Class-wise F1-scores:
- Angry: 0.70
- Disgusted: 0.83
- Happy: 0.73
- Neutral: 0.96
- Sad: 0.54
- Scared: 0.79
- Surprised: 0.98
Observations:
- Excellent recognition of
Neutral
andSurprised
speech. - Good generalization on
Disgusted
andScared
. - Confusion between
Sad
andNeutral
(51% ofSad
predicted asNeutral
). Happy
is also confounded withNeutral
in ~27% of cases.Scared
partially overlaps withDisgusted
.
Training Stability and Improvements
During training, the model demonstrated steady improvement with no overfitting signs. Session interruption (Google Colab) was handled with resume_from_checkpoint=True
.
Potential Improvements:
- Hyperparameter tuning via Optuna (
learning_rate
,weight_decay
,warmup_ratio
) - Class-sensitive losses (e.g.,
FocalLoss
, weightedCrossEntropyLoss
) - Data augmentation (noise, pitch shift, time stretch) for robustness
Datasets and Licensing
The model was trained using Russian speech samples from:
- Dusha dataset โ CC BY 4.0
- EmoGator dataset โ Apache 2.0
Groupings of emotions reflect established affective science models (Plutchik, Geneva Wheel).
License
This model is distributed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Use, modification, and redistribution are permitted with proper attribution.
Original datasets retain their respective licenses.
Intended Use
- Target: Russian emotional speech in semi-structured conditions
- Applications:
- Emotion-aware AI assistants
- Psychological state monitoring
- Multimodal interaction systems
- Experimental SER studies
Limitations
- Model performance may degrade with noise or dialectal input
- Emotion granularity is limited to 7 categories
- Confusions still occur for semantically or acoustically close emotions
Citation
To be added after the formal publication of EchoStressAI results.
Contact
Developed by https://huggingface.co/nikatonika
Project: EchoStressAI
- Downloads last month
- 14