From Local to Holistic: Multimodal Learning for Image Understanding
Access status:
Open Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Hao, YichaoAbstract
Image understanding is a fundamental task in computer vision (CV), aiming to extract meaningful information from images by progressing from local feature analysis to holistic scene-level reasoning. Recent advances in deep learning have enabled significant improvements; however, ...
See moreImage understanding is a fundamental task in computer vision (CV), aiming to extract meaningful information from images by progressing from local feature analysis to holistic scene-level reasoning. Recent advances in deep learning have enabled significant improvements; however, challenges remain when labeled data is limited, visual relationships are ambiguous, or the effective integration of external knowledge is required. This thesis investigates robust and knowledge-enhanced approaches for image understanding. First, to address data scarcity and subtle local variations in medical images, we investigated the impact of imaging protocols on medical images and proposed an adaptive weight estimation and segmentation method for beef composition. Predefined features are extracted from the local regions of interest in radiomics methods. They are effective when the labeled data is limited. We propose a radiomics-based approach for predicting prognosis from the primary tumor in colorectal cancer. Using multimodal image features, we developed a survival prediction model to assess patient outcomes. Second, for holistic image understanding in natural images, we propose a method that integrates Imagen and Logogen, refined through cross-cueing during training, to mitigate inter-class similarities and intra-class variations. Furthermore, adaptive knowledge bias is embedded to alleviate the long-tail distribution. Third, to improve the scalability and efficiency of knowledge integration in SGG, we introduce a novel SGG model to infer visual relationships leveraging the guidance from commonsense knowledge derived from Large Language Models (LLMs). Rigorous and extensive experiments are conducted on one private dataset with PET/ CT images for image analysis and large public natural image datasets for SGG. The experiments demonstrated that our proposed methods outperformed the corresponding state-of-the-art methods.
See less
See moreImage understanding is a fundamental task in computer vision (CV), aiming to extract meaningful information from images by progressing from local feature analysis to holistic scene-level reasoning. Recent advances in deep learning have enabled significant improvements; however, challenges remain when labeled data is limited, visual relationships are ambiguous, or the effective integration of external knowledge is required. This thesis investigates robust and knowledge-enhanced approaches for image understanding. First, to address data scarcity and subtle local variations in medical images, we investigated the impact of imaging protocols on medical images and proposed an adaptive weight estimation and segmentation method for beef composition. Predefined features are extracted from the local regions of interest in radiomics methods. They are effective when the labeled data is limited. We propose a radiomics-based approach for predicting prognosis from the primary tumor in colorectal cancer. Using multimodal image features, we developed a survival prediction model to assess patient outcomes. Second, for holistic image understanding in natural images, we propose a method that integrates Imagen and Logogen, refined through cross-cueing during training, to mitigate inter-class similarities and intra-class variations. Furthermore, adaptive knowledge bias is embedded to alleviate the long-tail distribution. Third, to improve the scalability and efficiency of knowledge integration in SGG, we introduce a novel SGG model to infer visual relationships leveraging the guidance from commonsense knowledge derived from Large Language Models (LLMs). Rigorous and extensive experiments are conducted on one private dataset with PET/ CT images for image analysis and large public natural image datasets for SGG. The experiments demonstrated that our proposed methods outperformed the corresponding state-of-the-art methods.
See less
Date
2026Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering, School of Computer ScienceAwarding institution
The University of SydneyShare