Exploring Self-Supervised Learning for Speech Emotion Recognition: Feature Analysis, Dimensional Enhancement and Emotion Classification

Pratihast, Manisha

Access status:

USyd Access

Field	Value	Language
dc.contributor.author	Pratihast, Manisha
dc.date.accessioned	2025-07-14T04:15:58Z
dc.date.available	2025-07-14T04:15:58Z
dc.date.issued	2025	en_AU
dc.identifier.uri	https://hdl.handle.net/2123/34103
dc.description.abstract	Speech Emotion Recognition (SER) aims to identify emotional states from speech signals by analyzing acoustic properties that reflect affective expression. Traditional SER approaches often rely on handcrafted acoustic features such as Mel-Frequency Cepstral Coefficients (MFCCs) and prosodic descriptors, which may lack the capacity to capture context-sensitive or subtle emotional variations. Recent advancements in self-supervised learning (SSL) have enabled the development of models trained on large-scale unlabeled speech data, producing general-purpose speech embeddings that enhance emotion recognition without task-specific fine-tuning. This thesis investigates the effectiveness of SSL-derived acoustic embeddings in both dimensional and categorical SER tasks, with a particular focus on dimensional SER (DSER). The study addresses three key objectives: (1) systematically compare traditional handcrafted features with SSL embeddings across three benchmark datasets for DSER; (2) enhance temporal modeling of emotional dynamics using transformer-based encoders with a two-step sequence reduction strategy; and (3) explore strategies to improve categorical SER (CSER) by leveraging DSER outputs through integration, regression-informed mapping, and multi-task learning (MTL). Empirical results demonstrate that pre-trained SSL models such as WavLM and UniSpeech-SAT outperform traditional baselines in DSER, with the greatest improvements observed for valence, followed by dominance and arousal. Transformer-based architectures with sequence reduction further enhance valence prediction. Integrating DSER into CSER frameworks yields consistent performance gains, particularly via MTL and SSL-enhanced mappings. This work contributes to building more generalizable, flexible, and context-aware SER systems.	en_AU
dc.language.iso	en	en_AU
dc.subject	Speech Emotion Recognition	en_AU
dc.subject	Self-Supervised Learning	en_AU
dc.title	Exploring Self-Supervised Learning for Speech Emotion Recognition: Feature Analysis, Dimensional Enhancement and Emotion Classification	en_AU
dc.type	Thesis
dc.type.thesis	Doctor of Philosophy	en_AU
dc.rights.other	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en_AU
usyd.faculty	SeS faculties schools::Faculty of Engineering::School of Electrical and Information Engineering	en_AU
usyd.degree	Doctor of Philosophy Ph.D.	en_AU
usyd.awardinginst	The University of Sydney	en_AU
usyd.advisor	Jin, Craig
usyd.include.pub	No	en_AU