Towards Controllable and Interpretable Latent Modeling for Vision and Beyond

Zheng, Jiyang

Permalink

Access status:

Open Access

Type

Thesis

Thesis type

Doctor of Philosophy

Author/s

Zheng, Jiyang

Abstract

Deep learning models for vision and multimodal data rely on high-dimensional latent representations to achieve strong empirical performance. However, such representations are often difficult to interpret and control, limiting their reliability and adaptability in tasks requiring ...
See moreDeep learning models for vision and multimodal data rely on high-dimensional latent representations to achieve strong empirical performance. However, such representations are often difficult to interpret and control, limiting their reliability and adaptability in tasks requiring structured reasoning. This thesis studies how latent representations can be structured to enhance interpretability and controllability in visual and multimodal learning. It argues that effective latent modeling requires separating task-relevant semantics from irrelevant variability, and enforcing selective invariance and alignment during training. Rather than relying on entangled or opaque latent spaces, the thesis advocates for structured representations whose components can be explicitly preserved, modified, or aligned according to task demands. The first part of the thesis examines this principle in ordinal visual learning, where discriminative information is subtle and order-dependent. It shows that commonly used representation learning strategies induce excessive invariance, obscuring ordinal semantics in the latent space. By encouraging minimal and targeted latent variation, the proposed approach retains ordinal meaning while remaining robust to irrelevant changes. The second part studies controllability in visual in-context learning, demonstrating that latent representations in large autoregressive vision models can be made more interpretable through structured intermediate representations that reflect progressive visual reasoning. The third part extends these principles to multimodal generation by learning selectively aligned latent spaces that capture shared semantic factors while excluding modality-specific variability. Finally, the thesis generalises latent modeling beyond internal representations to interpretable procedural structures, showing that agentic workflows can be treated as latent constructs whose organisation is learned under explicit constraints.
See less

Date

2026

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Engineering, School of Computer Science

Awarding institution

The University of Sydney

Subjects

Deep Learning
Trustworthy Machine Learning