Multimodality Integration for Natural Language Generation
Field | Value | Language |
dc.contributor.author | Wang, Eileen | |
dc.date.accessioned | 2025-01-28T06:21:44Z | |
dc.date.available | 2025-01-28T06:21:44Z | |
dc.date.issued | 2025 | en_AU |
dc.identifier.uri | https://hdl.handle.net/2123/33559 | |
dc.description.abstract | Multimodal Natural Language Processing integrates text processing with other data modalities like image, video, audio and speech. Recently, developing models capable of generating human-like language conditioned on multimodal data has been a growing research area. In this thesis, I focus on two Natural Language Generation (NLG) tasks: 1) Visual Storytelling (VST) which requires a model to produce a matching story given an image sequence, and 2) Video Paragraph Captioning (VPC) where given a video prompt, a coherent textual summary describing key events in the video needs to be generated. To understand these image and video modalities, many models utilise pretrained networks to convert the raw data into an embedding stream which is then fed into a sequence-to-sequence model to decode the text. However, this fails to explicitly capture relations between important aspects such as interactions between multimodal features or temporal relationships between key events. Thus, how to best represent this complex multimodal data is an ongoing challenge. Moreover, another challenge is that models tend to lack commonsense and reasoning ability due to only being exposed to limited training data. Consequently, models become incapable of producing outputs that go beyond the pattern recognised in the data. Therefore, to address these challenges, this research explores methods for transforming the image and video input into high-level semantic commonsense-enhanced graphs to promote scene and context understanding. Furthermore, for evaluation, due to the lack of metrics designed for VST, my thesis additionally proposes a novel reference-free metric which takes into account the fact that storytelling is subjective in nature. As the final proposed work, a multimodal commonsense knowledge graph focusing on social and event-based knowledge is introduced. It is hoped that such multimodal commonsense knowledge can be used as auxiliary information to help improve NLG tasks like VST and VPC. | en_AU |
dc.language.iso | en | en_AU |
dc.subject | visual storytelling | en_AU |
dc.subject | video captioning | en_AU |
dc.subject | storytelling metrics | en_AU |
dc.subject | multimodal text generation | en_AU |
dc.subject | multimodal knowledge graph | en_AU |
dc.title | Multimodality Integration for Natural Language Generation | en_AU |
dc.type | Thesis | |
dc.type.thesis | Doctor of Philosophy | en_AU |
dc.rights.other | The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission. | en_AU |
usyd.faculty | SeS faculties schools::Faculty of Engineering::School of Computer Science | en_AU |
usyd.degree | Doctor of Philosophy Ph.D. | en_AU |
usyd.awardinginst | The University of Sydney | en_AU |
usyd.advisor | Poon, Josiah |
Associated file/s
Associated collections