YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model Card: aniemore-audio-finetuned

Model Summary

Aniemore/wavlm-emotion-russian-resd fine-tuned on a realistically distributed Russian-language dataset of emotional speech for 7-class emotion classification. Training was conducted as part of the EchoStressAI project to evaluate model performance in real-world class imbalance scenarios.


Model Details

  • Model type: WavLMForSequenceClassification
  • Pretrained base: Aniemore/wavlm-emotion-russian-resd
  • Fine-tuned dataset: Unbalanced natural dataset (from Dusha and EmoGator)
  • Languages: Russian
  • Task: Speech Emotion Recognition (SER)

Label Mapping

ID Label
0 Angry
1 Disgusted
2 Happy
3 Neutral
4 Sad
5 Scared
6 Surprised

Training Details

  • Epochs: 8 (checkpoint selected at step 135000)
  • Batch size: 4
  • Learning rate: 1e-5
  • Optimizer: AdamW
  • Loss function: CrossEntropyLoss
  • Scheduler: Linear
  • Monitoring: Weights & Biases (wandb)
  • Mixed precision: Enabled (fp16)
  • Resume logic: resume_from_checkpoint=True
  • Best model selection: load_best_model_at_end=True with metric_for_best_model="f1"

Evaluation Results (Test Set, 5463 samples)

Metric Value
Accuracy 0.93
F1-score (macro avg) 0.79
F1-score (weighted) 0.93
Precision (macro) 0.80
Recall (macro) 0.79

Class-wise F1-scores:

  • Angry: 0.70
  • Disgusted: 0.83
  • Happy: 0.73
  • Neutral: 0.96
  • Sad: 0.54
  • Scared: 0.79
  • Surprised: 0.98

Observations:

  • Excellent recognition of Neutral and Surprised speech.
  • Good generalization on Disgusted and Scared.
  • Confusion between Sad and Neutral (51% of Sad predicted as Neutral).
  • Happy is also confounded with Neutral in ~27% of cases.
  • Scared partially overlaps with Disgusted.

Training Stability and Improvements

During training, the model demonstrated steady improvement with no overfitting signs. Session interruption (Google Colab) was handled with resume_from_checkpoint=True.

Potential Improvements:

  • Hyperparameter tuning via Optuna (learning_rate, weight_decay, warmup_ratio)
  • Class-sensitive losses (e.g., FocalLoss, weighted CrossEntropyLoss)
  • Data augmentation (noise, pitch shift, time stretch) for robustness

Datasets and Licensing

The model was trained using Russian speech samples from:

Groupings of emotions reflect established affective science models (Plutchik, Geneva Wheel).


License

This model is distributed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Use, modification, and redistribution are permitted with proper attribution.
Original datasets retain their respective licenses.


Intended Use

  • Target: Russian emotional speech in semi-structured conditions
  • Applications:
    • Emotion-aware AI assistants
    • Psychological state monitoring
    • Multimodal interaction systems
    • Experimental SER studies

Limitations

  • Model performance may degrade with noise or dialectal input
  • Emotion granularity is limited to 7 categories
  • Confusions still occur for semantically or acoustically close emotions

Citation

To be added after the formal publication of EchoStressAI results.


Contact

Developed by https://huggingface.co/nikatonika
Project: EchoStressAI

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support