Toward Multi-modal Multi-aspect Deep Alignment and Integration

Long, Siqu

Access status:

Open Access

Field	Value	Language
dc.contributor.author	Long, Siqu
dc.date.accessioned	2024-03-05T03:46:16Z
dc.date.available	2024-03-05T03:46:16Z
dc.date.issued	2024	en_AU
dc.identifier.uri	https://hdl.handle.net/2123/32303
dc.description	Includes publication
dc.description.abstract	Multi-modal/-aspect data contains complementary information about the same thing of interest that has the promising potential of leading to improved model robustness and thus gaining an increasing research focus. There are two typical categories of multi-modal/-aspect problems that require crossmodal/- aspect alignment and integration: 1) heterogeneous multi-modal problems that deal with data from multiple media forms, such as text, image etc., and 2) homogeneous multi-aspect problems that handle data with different aspects represented by the same media form, such as the syntactic and semantic aspects of a textual sentence etc. However, most of the existing approaches for multimodal/- aspect simply tackle the cross-modal/-aspect alignment and integration through various deep learning neural networks in an implicit manner and optimize based on the final task goals, leaving the potential strategies for improving the cross-modal/-aspect alignment and integration under-explored. This thesis aims to initiate an exploration of strategies and approaches towards multi-modal/-aspect deep alignment and integration. By looking into the limitations of existing approaches for both heterogeneous multi-modal problems and homogeneous multi-aspect problems, it proposes novel strategies and approaches for improving the cross-modal/-aspect alignment and integration and evaluates on the most essential representative tasks. For the heterogeneous setting, a cross-modal information captured graph-structured representation learning approach is proposed to enforce better cross-modal alignment and evaluated on the Language-to-Vision and Vision-and-Language scenarios. On the other hand, for the homogeneous setting, a bi-directional and deep crossintegration mechanism is explored to synthesise the multi-level semantics for comprehensive text understanding, which is validated in the joint multi-aspect natural language understanding context and its generalised text understanding setting.	en_AU
dc.language.iso	en	en_AU
dc.subject	text image matching	en_AU
dc.subject	text to image generation	en_AU
dc.subject	multi-modal learning	en_AU
dc.subject	joint intent classification and slot filling	en_AU
dc.subject	multi-aspect learning	en_AU
dc.title	Toward Multi-modal Multi-aspect Deep Alignment and Integration	en_AU
dc.type	Thesis
dc.type.thesis	Doctor of Philosophy	en_AU
dc.rights.other	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en_AU
usyd.faculty	SeS faculties schools::Faculty of Engineering::School of Computer Science	en_AU
usyd.degree	Doctor of Philosophy Ph.D.	en_AU
usyd.awardinginst	The University of Sydney	en_AU
usyd.advisor	Poon, Josiah
usyd.include.pub	Yes	en_AU