Multimodality Integration for Natural Language Generation

Wang, Eileen

Access status:

Open Access

Field	Value	Language
dc.contributor.author	Wang, Eileen
dc.date.accessioned	2025-01-28T06:21:44Z
dc.date.available	2025-01-28T06:21:44Z
dc.date.issued	2025	en_AU
dc.identifier.uri	https://hdl.handle.net/2123/33559
dc.description.abstract	Multimodal Natural Language Processing integrates text processing with other data modalities like image, video, audio and speech. Recently, developing models capable of generating human-like language conditioned on multimodal data has been a growing research area. In this thesis, I focus on two Natural Language Generation (NLG) tasks: 1) Visual Storytelling (VST) which requires a model to produce a matching story given an image sequence, and 2) Video Paragraph Captioning (VPC) where given a video prompt, a coherent textual summary describing key events in the video needs to be generated. To understand these image and video modalities, many models utilise pretrained networks to convert the raw data into an embedding stream which is then fed into a sequence-to-sequence model to decode the text. However, this fails to explicitly capture relations between important aspects such as interactions between multimodal features or temporal relationships between key events. Thus, how to best represent this complex multimodal data is an ongoing challenge. Moreover, another challenge is that models tend to lack commonsense and reasoning ability due to only being exposed to limited training data. Consequently, models become incapable of producing outputs that go beyond the pattern recognised in the data. Therefore, to address these challenges, this research explores methods for transforming the image and video input into high-level semantic commonsense-enhanced graphs to promote scene and context understanding. Furthermore, for evaluation, due to the lack of metrics designed for VST, my thesis additionally proposes a novel reference-free metric which takes into account the fact that storytelling is subjective in nature. As the final proposed work, a multimodal commonsense knowledge graph focusing on social and event-based knowledge is introduced. It is hoped that such multimodal commonsense knowledge can be used as auxiliary information to help improve NLG tasks like VST and VPC.	en_AU
dc.language.iso	en	en_AU
dc.subject	visual storytelling	en_AU
dc.subject	video captioning	en_AU
dc.subject	storytelling metrics	en_AU
dc.subject	multimodal text generation	en_AU
dc.subject	multimodal knowledge graph	en_AU
dc.title	Multimodality Integration for Natural Language Generation	en_AU
dc.type	Thesis
dc.type.thesis	Doctor of Philosophy	en_AU
dc.rights.other	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en_AU
usyd.faculty	SeS faculties schools::Faculty of Engineering::School of Computer Science	en_AU
usyd.degree	Doctor of Philosophy Ph.D.	en_AU
usyd.awardinginst	The University of Sydney	en_AU
usyd.advisor	Poon, Josiah