Multi-modal Understanding and Generation

Wang, Zhaoqing

Permalink

Access status:

Open Access

Type

Thesis

Thesis type

Doctor of Philosophy

Author/s

Wang, Zhaoqing

Abstract

The convergence of computer vision and natural language processing has been accelerated by large-scale pre-training, enabling machines to align visual concepts with semantic meaning. However, a gap remains between the coarse capabilities of current foundation models and the demands ...
See moreThe convergence of computer vision and natural language processing has been accelerated by large-scale pre-training, enabling machines to align visual concepts with semantic meaning. However, a gap remains between the coarse capabilities of current foundation models and the demands of advanced multi-modal intelligence, especially for fine-grained dense prediction and unified generation in open-world settings. Existing approaches are constrained by the high cost of pixel-level supervision, the rigidity of closed-vocabulary training, and the separation between discriminative and generative architectures. This thesis addresses these challenges by developing frameworks that improve the alignment, scalability, and universality of multi-modal systems. First, we study Referring Image Segmentation, which aims to segment a target region from a linguistic description. We propose CRIS, a CLIP-driven framework that enforces text-to-pixel alignment through a novel vision-language decoder and contrastive learning, achieving state-of-the-art performance on complex visual entities. Second, we introduce Unpair-Seg for open-vocabulary segmentation with unpaired mask-text supervision. By using standalone images, image-text pairs, and image-mask pairs without aligned triplets, it greatly reduces annotation cost. A large vision-language model and a multi-scale matching strategy further improve mask-text alignment, narrowing the gap between weakly and fully supervised methods. Third, we present LaVin-DiT, a large vision diffusion transformer that unifies over 20 computer vision tasks within a single generative framework. By modeling visual data in a continuous latent space, it supports diverse tasks through in-context learning without task-specific fine-tuning. Overall, this thesis advances multi-modal AI by bridging image-level pre-training and fine-grained pixel-level tasks, while moving toward scalable and unified generative vision foundation models.
See less

Date

2026

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Engineering, School of Electrical and Information Engineering

Awarding institution

The University of Sydney

Subjects

multi-modal learning
vision-language model
referring image segmentation
open-vocabulary segmentation
generative model