Exploring Enhanced Visual Representation Learning for Improved Generative Modeling
Access status:
USyd Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Yue, XiaoyuAbstract
Image generation is a pivotal research direction in computer vision due to its wide-ranging potential applications. Despite substantial advances brought by the development of generative paradigms and network architectures, existing models often fail to fully exploit the rich ...
See moreImage generation is a pivotal research direction in computer vision due to its wide-ranging potential applications. Despite substantial advances brought by the development of generative paradigms and network architectures, existing models often fail to fully exploit the rich high-level semantic structure inherent in visual data. The insufficient incorporation of such semantic information limits their ability to accurately model complex real-world distributions. In this thesis, we address this limitation by integrating visual representation learning into generative frameworks, aiming to improve both the fidelity and the semantic coherence of generated images. Our research begins by investigating the intrinsic mechanisms of image generative models to verify their ability to learn high-level visual semantics. We propose a novel generative framework with a unified self-supervised training paradigm called GUNS. It employs a diffusion decoder to integrate diverse self-supervised pre-training objectives within a single denoising diffusion model. We subsequently leverage high-level visual semantics to enhance generative models, introducing three distinct methodologies for different generative paradigms and components: (1) Jointly training semantic information within the generator. (2) Injecting semantics during sampling. (3) Building a semantically aligned latent space. Extensive experiments validate the effectiveness of these three approaches and provide systematic evidence that high-quality visual semantic representations can actively enhance image generation. This research establishes a solid foundation for unifying image understanding and generation and aims to inspire future work on developing more semantically aware and controllable generative models.
See less
See moreImage generation is a pivotal research direction in computer vision due to its wide-ranging potential applications. Despite substantial advances brought by the development of generative paradigms and network architectures, existing models often fail to fully exploit the rich high-level semantic structure inherent in visual data. The insufficient incorporation of such semantic information limits their ability to accurately model complex real-world distributions. In this thesis, we address this limitation by integrating visual representation learning into generative frameworks, aiming to improve both the fidelity and the semantic coherence of generated images. Our research begins by investigating the intrinsic mechanisms of image generative models to verify their ability to learn high-level visual semantics. We propose a novel generative framework with a unified self-supervised training paradigm called GUNS. It employs a diffusion decoder to integrate diverse self-supervised pre-training objectives within a single denoising diffusion model. We subsequently leverage high-level visual semantics to enhance generative models, introducing three distinct methodologies for different generative paradigms and components: (1) Jointly training semantic information within the generator. (2) Injecting semantics during sampling. (3) Building a semantically aligned latent space. Extensive experiments validate the effectiveness of these three approaches and provide systematic evidence that high-quality visual semantic representations can actively enhance image generation. This research establishes a solid foundation for unifying image understanding and generation and aims to inspire future work on developing more semantically aware and controllable generative models.
See less
Date
2025Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering, School of Electrical and Information EngineeringAwarding institution
The University of SydneyShare