Multi-modal Understanding and Generation

Wang, Zhaoqing

Access status:

Open Access

Field	Value	Language
dc.contributor.author	Wang, Zhaoqing
dc.date.accessioned	2026-04-02T00:04:20Z
dc.date.available	2026-04-02T00:04:20Z
dc.date.issued	2026	en
dc.identifier.uri	https://hdl.handle.net/2123/35073
dc.description.abstract	The convergence of computer vision and natural language processing has been accelerated by large-scale pre-training, enabling machines to align visual concepts with semantic meaning. However, a gap remains between the coarse capabilities of current foundation models and the demands of advanced multi-modal intelligence, especially for fine-grained dense prediction and unified generation in open-world settings. Existing approaches are constrained by the high cost of pixel-level supervision, the rigidity of closed-vocabulary training, and the separation between discriminative and generative architectures. This thesis addresses these challenges by developing frameworks that improve the alignment, scalability, and universality of multi-modal systems. First, we study Referring Image Segmentation, which aims to segment a target region from a linguistic description. We propose CRIS, a CLIP-driven framework that enforces text-to-pixel alignment through a novel vision-language decoder and contrastive learning, achieving state-of-the-art performance on complex visual entities. Second, we introduce Unpair-Seg for open-vocabulary segmentation with unpaired mask-text supervision. By using standalone images, image-text pairs, and image-mask pairs without aligned triplets, it greatly reduces annotation cost. A large vision-language model and a multi-scale matching strategy further improve mask-text alignment, narrowing the gap between weakly and fully supervised methods. Third, we present LaVin-DiT, a large vision diffusion transformer that unifies over 20 computer vision tasks within a single generative framework. By modeling visual data in a continuous latent space, it supports diverse tasks through in-context learning without task-specific fine-tuning. Overall, this thesis advances multi-modal AI by bridging image-level pre-training and fine-grained pixel-level tasks, while moving toward scalable and unified generative vision foundation models.	en
dc.language.iso	en	en
dc.subject	multi-modal learning	en
dc.subject	vision-language model	en
dc.subject	referring image segmentation	en
dc.subject	open-vocabulary segmentation	en
dc.subject	generative model	en
dc.title	Multi-modal Understanding and Generation	en
dc.type	Thesis
dc.type.thesis	Doctor of Philosophy	en
dc.rights.other	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en
usyd.faculty	SeS faculties schools::Faculty of Engineering::School of Electrical and Information Engineering	en
usyd.degree	Doctor of Philosophy Ph.D.	en
usyd.awardinginst	The University of Sydney	en
usyd.advisor	Liu, Tongliang
usyd.include.pub	No	en