Show simple item record

FieldValueLanguage
dc.contributor.authorZheng, Jiyang
dc.date.accessioned2026-03-05T23:12:02Z
dc.date.available2026-03-05T23:12:02Z
dc.date.issued2026en
dc.identifier.urihttps://hdl.handle.net/2123/34959
dc.descriptionIncludes publication
dc.description.abstractDeep learning models for vision and multimodal data rely on high-dimensional latent representations to achieve strong empirical performance. However, such representations are often difficult to interpret and control, limiting their reliability and adaptability in tasks requiring structured reasoning. This thesis studies how latent representations can be structured to enhance interpretability and controllability in visual and multimodal learning. It argues that effective latent modeling requires separating task-relevant semantics from irrelevant variability, and enforcing selective invariance and alignment during training. Rather than relying on entangled or opaque latent spaces, the thesis advocates for structured representations whose components can be explicitly preserved, modified, or aligned according to task demands. The first part of the thesis examines this principle in ordinal visual learning, where discriminative information is subtle and order-dependent. It shows that commonly used representation learning strategies induce excessive invariance, obscuring ordinal semantics in the latent space. By encouraging minimal and targeted latent variation, the proposed approach retains ordinal meaning while remaining robust to irrelevant changes. The second part studies controllability in visual in-context learning, demonstrating that latent representations in large autoregressive vision models can be made more interpretable through structured intermediate representations that reflect progressive visual reasoning. The third part extends these principles to multimodal generation by learning selectively aligned latent spaces that capture shared semantic factors while excluding modality-specific variability. Finally, the thesis generalises latent modeling beyond internal representations to interpretable procedural structures, showing that agentic workflows can be treated as latent constructs whose organisation is learned under explicit constraints.en
dc.language.isoenen
dc.subjectDeep Learningen
dc.subjectTrustworthy Machine Learningen
dc.titleTowards Controllable and Interpretable Latent Modeling for Vision and Beyonden
dc.typeThesis
dc.type.thesisDoctor of Philosophyen
dc.rights.otherThe author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.en
usyd.facultySeS faculties schools::Faculty of Engineering::School of Computer Scienceen
usyd.degreeDoctor of Philosophy Ph.D.en
usyd.awardinginstThe University of Sydneyen
usyd.advisorLiu, Tongliang
usyd.include.pubYesen


Show simple item record

Associated file/s

Associated collections

Show simple item record

There are no previous versions of the item available.