Towards Controllable and Interpretable Latent Modeling for Vision and Beyond

Zheng, Jiyang

Access status:

Open Access

Field	Value	Language
dc.contributor.author	Zheng, Jiyang
dc.date.accessioned	2026-03-05T23:12:02Z
dc.date.available	2026-03-05T23:12:02Z
dc.date.issued	2026	en
dc.identifier.uri	https://hdl.handle.net/2123/34959
dc.description	Includes publication
dc.description.abstract	Deep learning models for vision and multimodal data rely on high-dimensional latent representations to achieve strong empirical performance. However, such representations are often difficult to interpret and control, limiting their reliability and adaptability in tasks requiring structured reasoning. This thesis studies how latent representations can be structured to enhance interpretability and controllability in visual and multimodal learning. It argues that effective latent modeling requires separating task-relevant semantics from irrelevant variability, and enforcing selective invariance and alignment during training. Rather than relying on entangled or opaque latent spaces, the thesis advocates for structured representations whose components can be explicitly preserved, modified, or aligned according to task demands. The first part of the thesis examines this principle in ordinal visual learning, where discriminative information is subtle and order-dependent. It shows that commonly used representation learning strategies induce excessive invariance, obscuring ordinal semantics in the latent space. By encouraging minimal and targeted latent variation, the proposed approach retains ordinal meaning while remaining robust to irrelevant changes. The second part studies controllability in visual in-context learning, demonstrating that latent representations in large autoregressive vision models can be made more interpretable through structured intermediate representations that reflect progressive visual reasoning. The third part extends these principles to multimodal generation by learning selectively aligned latent spaces that capture shared semantic factors while excluding modality-specific variability. Finally, the thesis generalises latent modeling beyond internal representations to interpretable procedural structures, showing that agentic workflows can be treated as latent constructs whose organisation is learned under explicit constraints.	en
dc.language.iso	en	en
dc.subject	Deep Learning	en
dc.subject	Trustworthy Machine Learning	en
dc.title	Towards Controllable and Interpretable Latent Modeling for Vision and Beyond	en
dc.type	Thesis
dc.type.thesis	Doctor of Philosophy	en
dc.rights.other	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en
usyd.faculty	SeS faculties schools::Faculty of Engineering::School of Computer Science	en
usyd.degree	Doctor of Philosophy Ph.D.	en
usyd.awardinginst	The University of Sydney	en
usyd.advisor	Liu, Tongliang
usyd.include.pub	Yes	en