Toward Multi-modal Multi-aspect Deep Alignment and Integration
Access status:
Open Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Long, SiquAbstract
Multi-modal/-aspect data contains complementary information about the same thing of interest that
has the promising potential of leading to improved model robustness and thus gaining an increasing
research focus. There are two typical categories of multi-modal/-aspect problems ...
See moreMulti-modal/-aspect data contains complementary information about the same thing of interest that has the promising potential of leading to improved model robustness and thus gaining an increasing research focus. There are two typical categories of multi-modal/-aspect problems that require crossmodal/- aspect alignment and integration: 1) heterogeneous multi-modal problems that deal with data from multiple media forms, such as text, image etc., and 2) homogeneous multi-aspect problems that handle data with different aspects represented by the same media form, such as the syntactic and semantic aspects of a textual sentence etc. However, most of the existing approaches for multimodal/- aspect simply tackle the cross-modal/-aspect alignment and integration through various deep learning neural networks in an implicit manner and optimize based on the final task goals, leaving the potential strategies for improving the cross-modal/-aspect alignment and integration under-explored. This thesis aims to initiate an exploration of strategies and approaches towards multi-modal/-aspect deep alignment and integration. By looking into the limitations of existing approaches for both heterogeneous multi-modal problems and homogeneous multi-aspect problems, it proposes novel strategies and approaches for improving the cross-modal/-aspect alignment and integration and evaluates on the most essential representative tasks. For the heterogeneous setting, a cross-modal information captured graph-structured representation learning approach is proposed to enforce better cross-modal alignment and evaluated on the Language-to-Vision and Vision-and-Language scenarios. On the other hand, for the homogeneous setting, a bi-directional and deep crossintegration mechanism is explored to synthesise the multi-level semantics for comprehensive text understanding, which is validated in the joint multi-aspect natural language understanding context and its generalised text understanding setting.
See less
See moreMulti-modal/-aspect data contains complementary information about the same thing of interest that has the promising potential of leading to improved model robustness and thus gaining an increasing research focus. There are two typical categories of multi-modal/-aspect problems that require crossmodal/- aspect alignment and integration: 1) heterogeneous multi-modal problems that deal with data from multiple media forms, such as text, image etc., and 2) homogeneous multi-aspect problems that handle data with different aspects represented by the same media form, such as the syntactic and semantic aspects of a textual sentence etc. However, most of the existing approaches for multimodal/- aspect simply tackle the cross-modal/-aspect alignment and integration through various deep learning neural networks in an implicit manner and optimize based on the final task goals, leaving the potential strategies for improving the cross-modal/-aspect alignment and integration under-explored. This thesis aims to initiate an exploration of strategies and approaches towards multi-modal/-aspect deep alignment and integration. By looking into the limitations of existing approaches for both heterogeneous multi-modal problems and homogeneous multi-aspect problems, it proposes novel strategies and approaches for improving the cross-modal/-aspect alignment and integration and evaluates on the most essential representative tasks. For the heterogeneous setting, a cross-modal information captured graph-structured representation learning approach is proposed to enforce better cross-modal alignment and evaluated on the Language-to-Vision and Vision-and-Language scenarios. On the other hand, for the homogeneous setting, a bi-directional and deep crossintegration mechanism is explored to synthesise the multi-level semantics for comprehensive text understanding, which is validated in the joint multi-aspect natural language understanding context and its generalised text understanding setting.
See less
Date
2024Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering, School of Computer ScienceAwarding institution
The University of SydneyShare