Show simple item record

FieldValueLanguage
dc.contributor.authorWang, Zhaoqing
dc.date.accessioned2026-04-02T00:04:20Z
dc.date.available2026-04-02T00:04:20Z
dc.date.issued2026en
dc.identifier.urihttps://hdl.handle.net/2123/35073
dc.description.abstractThe convergence of computer vision and natural language processing has been accelerated by large-scale pre-training, enabling machines to align visual concepts with semantic meaning. However, a gap remains between the coarse capabilities of current foundation models and the demands of advanced multi-modal intelligence, especially for fine-grained dense prediction and unified generation in open-world settings. Existing approaches are constrained by the high cost of pixel-level supervision, the rigidity of closed-vocabulary training, and the separation between discriminative and generative architectures. This thesis addresses these challenges by developing frameworks that improve the alignment, scalability, and universality of multi-modal systems. First, we study Referring Image Segmentation, which aims to segment a target region from a linguistic description. We propose CRIS, a CLIP-driven framework that enforces text-to-pixel alignment through a novel vision-language decoder and contrastive learning, achieving state-of-the-art performance on complex visual entities. Second, we introduce Unpair-Seg for open-vocabulary segmentation with unpaired mask-text supervision. By using standalone images, image-text pairs, and image-mask pairs without aligned triplets, it greatly reduces annotation cost. A large vision-language model and a multi-scale matching strategy further improve mask-text alignment, narrowing the gap between weakly and fully supervised methods. Third, we present LaVin-DiT, a large vision diffusion transformer that unifies over 20 computer vision tasks within a single generative framework. By modeling visual data in a continuous latent space, it supports diverse tasks through in-context learning without task-specific fine-tuning. Overall, this thesis advances multi-modal AI by bridging image-level pre-training and fine-grained pixel-level tasks, while moving toward scalable and unified generative vision foundation models.en
dc.language.isoenen
dc.subjectmulti-modal learningen
dc.subjectvision-language modelen
dc.subjectreferring image segmentationen
dc.subjectopen-vocabulary segmentationen
dc.subjectgenerative modelen
dc.titleMulti-modal Understanding and Generationen
dc.typeThesis
dc.type.thesisDoctor of Philosophyen
dc.rights.otherThe author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.en
usyd.facultySeS faculties schools::Faculty of Engineering::School of Electrical and Information Engineeringen
usyd.degreeDoctor of Philosophy Ph.D.en
usyd.awardinginstThe University of Sydneyen
usyd.advisorLiu, Tongliang
usyd.include.pubNoen


Show simple item record

Associated file/s

Associated collections

Show simple item record

There are no previous versions of the item available.