Show simple item record

FieldValueLanguage
dc.contributor.authorPratihast, Manisha
dc.date.accessioned2025-07-14T04:15:58Z
dc.date.available2025-07-14T04:15:58Z
dc.date.issued2025en_AU
dc.identifier.urihttps://hdl.handle.net/2123/34103
dc.description.abstractSpeech Emotion Recognition (SER) aims to identify emotional states from speech signals by analyzing acoustic properties that reflect affective expression. Traditional SER approaches often rely on handcrafted acoustic features such as Mel-Frequency Cepstral Coefficients (MFCCs) and prosodic descriptors, which may lack the capacity to capture context-sensitive or subtle emotional variations. Recent advancements in self-supervised learning (SSL) have enabled the development of models trained on large-scale unlabeled speech data, producing general-purpose speech embeddings that enhance emotion recognition without task-specific fine-tuning. This thesis investigates the effectiveness of SSL-derived acoustic embeddings in both dimensional and categorical SER tasks, with a particular focus on dimensional SER (DSER). The study addresses three key objectives: (1) systematically compare traditional handcrafted features with SSL embeddings across three benchmark datasets for DSER; (2) enhance temporal modeling of emotional dynamics using transformer-based encoders with a two-step sequence reduction strategy; and (3) explore strategies to improve categorical SER (CSER) by leveraging DSER outputs through integration, regression-informed mapping, and multi-task learning (MTL). Empirical results demonstrate that pre-trained SSL models such as WavLM and UniSpeech-SAT outperform traditional baselines in DSER, with the greatest improvements observed for valence, followed by dominance and arousal. Transformer-based architectures with sequence reduction further enhance valence prediction. Integrating DSER into CSER frameworks yields consistent performance gains, particularly via MTL and SSL-enhanced mappings. This work contributes to building more generalizable, flexible, and context-aware SER systems.en_AU
dc.language.isoenen_AU
dc.subjectSpeech Emotion Recognitionen_AU
dc.subjectSelf-Supervised Learningen_AU
dc.titleExploring Self-Supervised Learning for Speech Emotion Recognition: Feature Analysis, Dimensional Enhancement and Emotion Classificationen_AU
dc.typeThesis
dc.type.thesisDoctor of Philosophyen_AU
dc.rights.otherThe author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.en_AU
usyd.facultySeS faculties schools::Faculty of Engineering::School of Electrical and Information Engineeringen_AU
usyd.degreeDoctor of Philosophy Ph.D.en_AU
usyd.awardinginstThe University of Sydneyen_AU
usyd.advisorJin, Craig
usyd.include.pubNoen_AU


Show simple item record

Associated file/s

Associated collections

Show simple item record

There are no previous versions of the item available.