Model Card: aniemore-audio-finetuned

Model Summary

Aniemore/wavlm-emotion-russian-resd fine-tuned on a realistically distributed Russian-language dataset of emotional speech for 7-class emotion classification. Training was conducted as part of the EchoStressAI project to evaluate model performance in real-world class imbalance scenarios.

Model Details

Model type: WavLMForSequenceClassification
Pretrained base: Aniemore/wavlm-emotion-russian-resd
Fine-tuned dataset: Unbalanced natural dataset (from Dusha and EmoGator)
Languages: Russian
Task: Speech Emotion Recognition (SER)

Label Mapping

ID	Label
0	Angry
1	Disgusted
2	Happy
3	Neutral
4	Sad
5	Scared
6	Surprised

Training Details

Epochs: 8 (checkpoint selected at step 135000)
Batch size: 4
Learning rate: 1e-5
Optimizer: AdamW
Loss function: CrossEntropyLoss
Scheduler: Linear
Monitoring: Weights & Biases (wandb)
Mixed precision: Enabled (fp16)
Resume logic: resume_from_checkpoint=True
Best model selection: load_best_model_at_end=True with metric_for_best_model="f1"

Evaluation Results (Test Set, 5463 samples)

Metric	Value
Accuracy	0.93
F1-score (macro avg)	0.79
F1-score (weighted)	0.93
Precision (macro)	0.80
Recall (macro)	0.79

Class-wise F1-scores:

Angry: 0.70
Disgusted: 0.83
Happy: 0.73
Neutral: 0.96
Sad: 0.54
Scared: 0.79
Surprised: 0.98

Observations:

Excellent recognition of Neutral and Surprised speech.
Good generalization on Disgusted and Scared.
Confusion between Sad and Neutral (51% of Sad predicted as Neutral).
Happy is also confounded with Neutral in ~27% of cases.
Scared partially overlaps with Disgusted.

Training Stability and Improvements

During training, the model demonstrated steady improvement with no overfitting signs. Session interruption (Google Colab) was handled with resume_from_checkpoint=True.

Potential Improvements:

Hyperparameter tuning via Optuna (learning_rate, weight_decay, warmup_ratio)
Class-sensitive losses (e.g., FocalLoss, weighted CrossEntropyLoss)
Data augmentation (noise, pitch shift, time stretch) for robustness

Datasets and Licensing

The model was trained using Russian speech samples from:

Dusha dataset — CC BY 4.0
EmoGator dataset — Apache 2.0

Groupings of emotions reflect established affective science models (Plutchik, Geneva Wheel).

License

This model is distributed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Use, modification, and redistribution are permitted with proper attribution.
Original datasets retain their respective licenses.

Intended Use

Target: Russian emotional speech in semi-structured conditions
Applications:
- Emotion-aware AI assistants
- Psychological state monitoring
- Multimodal interaction systems
- Experimental SER studies

Limitations

Model performance may degrade with noise or dialectal input
Emotion granularity is limited to 7 categories
Confusions still occur for semantically or acoustically close emotions

Citation

To be added after the formal publication of EchoStressAI results.

Contact

Developed by https://huggingface.co/nikatonika
Project: EchoStressAI