Exploring Self-Supervised Learning for Speech Emotion Recognition: Feature Analysis, Dimensional Enhancement and Emotion Classification

Pratihast, Manisha

Permalink

Access status:

USyd Access

Type

Thesis

Thesis type

Doctor of Philosophy

Author/s

Pratihast, Manisha

Abstract

Speech Emotion Recognition (SER) aims to identify emotional states from speech signals by analyzing acoustic properties that reflect affective expression. Traditional SER approaches often rely on handcrafted acoustic features such as Mel-Frequency Cepstral Coefficients (MFCCs) and ...
See moreSpeech Emotion Recognition (SER) aims to identify emotional states from speech signals by analyzing acoustic properties that reflect affective expression. Traditional SER approaches often rely on handcrafted acoustic features such as Mel-Frequency Cepstral Coefficients (MFCCs) and prosodic descriptors, which may lack the capacity to capture context-sensitive or subtle emotional variations. Recent advancements in self-supervised learning (SSL) have enabled the development of models trained on large-scale unlabeled speech data, producing general-purpose speech embeddings that enhance emotion recognition without task-specific fine-tuning. This thesis investigates the effectiveness of SSL-derived acoustic embeddings in both dimensional and categorical SER tasks, with a particular focus on dimensional SER (DSER). The study addresses three key objectives: (1) systematically compare traditional handcrafted features with SSL embeddings across three benchmark datasets for DSER; (2) enhance temporal modeling of emotional dynamics using transformer-based encoders with a two-step sequence reduction strategy; and (3) explore strategies to improve categorical SER (CSER) by leveraging DSER outputs through integration, regression-informed mapping, and multi-task learning (MTL). Empirical results demonstrate that pre-trained SSL models such as WavLM and UniSpeech-SAT outperform traditional baselines in DSER, with the greatest improvements observed for valence, followed by dominance and arousal. Transformer-based architectures with sequence reduction further enhance valence prediction. Integrating DSER into CSER frameworks yields consistent performance gains, particularly via MTL and SSL-enhanced mappings. This work contributes to building more generalizable, flexible, and context-aware SER systems.
See less

Date

2025

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Engineering, School of Electrical and Information Engineering

Awarding institution

The University of Sydney

Subjects

Speech Emotion Recognition
Self-Supervised Learning