Multi-modal Understanding and Generation
Access status:
Open Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Wang, ZhaoqingAbstract
The convergence of computer vision and natural language processing has been accelerated by large-scale pre-training, enabling machines to align visual concepts with semantic meaning. However, a gap remains between the coarse capabilities of current foundation models and the demands ...
See moreThe convergence of computer vision and natural language processing has been accelerated by large-scale pre-training, enabling machines to align visual concepts with semantic meaning. However, a gap remains between the coarse capabilities of current foundation models and the demands of advanced multi-modal intelligence, especially for fine-grained dense prediction and unified generation in open-world settings. Existing approaches are constrained by the high cost of pixel-level supervision, the rigidity of closed-vocabulary training, and the separation between discriminative and generative architectures. This thesis addresses these challenges by developing frameworks that improve the alignment, scalability, and universality of multi-modal systems. First, we study Referring Image Segmentation, which aims to segment a target region from a linguistic description. We propose CRIS, a CLIP-driven framework that enforces text-to-pixel alignment through a novel vision-language decoder and contrastive learning, achieving state-of-the-art performance on complex visual entities. Second, we introduce Unpair-Seg for open-vocabulary segmentation with unpaired mask-text supervision. By using standalone images, image-text pairs, and image-mask pairs without aligned triplets, it greatly reduces annotation cost. A large vision-language model and a multi-scale matching strategy further improve mask-text alignment, narrowing the gap between weakly and fully supervised methods. Third, we present LaVin-DiT, a large vision diffusion transformer that unifies over 20 computer vision tasks within a single generative framework. By modeling visual data in a continuous latent space, it supports diverse tasks through in-context learning without task-specific fine-tuning. Overall, this thesis advances multi-modal AI by bridging image-level pre-training and fine-grained pixel-level tasks, while moving toward scalable and unified generative vision foundation models.
See less
See moreThe convergence of computer vision and natural language processing has been accelerated by large-scale pre-training, enabling machines to align visual concepts with semantic meaning. However, a gap remains between the coarse capabilities of current foundation models and the demands of advanced multi-modal intelligence, especially for fine-grained dense prediction and unified generation in open-world settings. Existing approaches are constrained by the high cost of pixel-level supervision, the rigidity of closed-vocabulary training, and the separation between discriminative and generative architectures. This thesis addresses these challenges by developing frameworks that improve the alignment, scalability, and universality of multi-modal systems. First, we study Referring Image Segmentation, which aims to segment a target region from a linguistic description. We propose CRIS, a CLIP-driven framework that enforces text-to-pixel alignment through a novel vision-language decoder and contrastive learning, achieving state-of-the-art performance on complex visual entities. Second, we introduce Unpair-Seg for open-vocabulary segmentation with unpaired mask-text supervision. By using standalone images, image-text pairs, and image-mask pairs without aligned triplets, it greatly reduces annotation cost. A large vision-language model and a multi-scale matching strategy further improve mask-text alignment, narrowing the gap between weakly and fully supervised methods. Third, we present LaVin-DiT, a large vision diffusion transformer that unifies over 20 computer vision tasks within a single generative framework. By modeling visual data in a continuous latent space, it supports diverse tasks through in-context learning without task-specific fine-tuning. Overall, this thesis advances multi-modal AI by bridging image-level pre-training and fine-grained pixel-level tasks, while moving toward scalable and unified generative vision foundation models.
See less
Date
2026Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering, School of Electrical and Information EngineeringAwarding institution
The University of SydneyShare