Toward Multi-modal Multi-aspect Deep Alignment and Integration
Field | Value | Language |
dc.contributor.author | Long, Siqu | |
dc.date.accessioned | 2024-03-05T03:46:16Z | |
dc.date.available | 2024-03-05T03:46:16Z | |
dc.date.issued | 2024 | en_AU |
dc.identifier.uri | https://hdl.handle.net/2123/32303 | |
dc.description | Includes publication | |
dc.description.abstract | Multi-modal/-aspect data contains complementary information about the same thing of interest that has the promising potential of leading to improved model robustness and thus gaining an increasing research focus. There are two typical categories of multi-modal/-aspect problems that require crossmodal/- aspect alignment and integration: 1) heterogeneous multi-modal problems that deal with data from multiple media forms, such as text, image etc., and 2) homogeneous multi-aspect problems that handle data with different aspects represented by the same media form, such as the syntactic and semantic aspects of a textual sentence etc. However, most of the existing approaches for multimodal/- aspect simply tackle the cross-modal/-aspect alignment and integration through various deep learning neural networks in an implicit manner and optimize based on the final task goals, leaving the potential strategies for improving the cross-modal/-aspect alignment and integration under-explored. This thesis aims to initiate an exploration of strategies and approaches towards multi-modal/-aspect deep alignment and integration. By looking into the limitations of existing approaches for both heterogeneous multi-modal problems and homogeneous multi-aspect problems, it proposes novel strategies and approaches for improving the cross-modal/-aspect alignment and integration and evaluates on the most essential representative tasks. For the heterogeneous setting, a cross-modal information captured graph-structured representation learning approach is proposed to enforce better cross-modal alignment and evaluated on the Language-to-Vision and Vision-and-Language scenarios. On the other hand, for the homogeneous setting, a bi-directional and deep crossintegration mechanism is explored to synthesise the multi-level semantics for comprehensive text understanding, which is validated in the joint multi-aspect natural language understanding context and its generalised text understanding setting. | en_AU |
dc.language.iso | en | en_AU |
dc.subject | text image matching | en_AU |
dc.subject | text to image generation | en_AU |
dc.subject | multi-modal learning | en_AU |
dc.subject | joint intent classification and slot filling | en_AU |
dc.subject | multi-aspect learning | en_AU |
dc.title | Toward Multi-modal Multi-aspect Deep Alignment and Integration | en_AU |
dc.type | Thesis | |
dc.type.thesis | Doctor of Philosophy | en_AU |
dc.rights.other | The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission. | en_AU |
usyd.faculty | SeS faculties schools::Faculty of Engineering::School of Computer Science | en_AU |
usyd.degree | Doctor of Philosophy Ph.D. | en_AU |
usyd.awardinginst | The University of Sydney | en_AU |
usyd.advisor | Poon, Josiah | |
usyd.include.pub | Yes | en_AU |
Associated file/s
Associated collections