Show simple item record

FieldValueLanguage
dc.contributor.authorZheng, Qi
dc.date.accessioned2023-08-29T05:59:16Z
dc.date.available2023-08-29T05:59:16Z
dc.date.issued2023en_AU
dc.identifier.urihttps://hdl.handle.net/2123/31617
dc.description.abstractTo build machine agents with intelligent capabilities mimicking human perception and cognition, vision and language stand out as two essential modalities and foster computer vision and natural language processing. Advances in such realms stimulate research in vision-language multimodal learning that allows optical and linguistic inputs and outputs. Due to the innate difference between the two modalities and the lack of large-scale fine-grained annotations, multimodal agents tend to inherit unimodal shortcuts. In this thesis, we develop various solutions to intervene unimodal shortcuts for multimodal generation and reasoning. For visual shortcuts, we introduce a linguistic prior and devise a syntax-aware action targeting module for dynamic description to rectify the correlation between subject and object in a sentence. We apply concept hierarchy and propose a visual superordinate abstraction framework for unbiased concept learning to reduce the correlation among different attributes of an object. For linguistic shortcuts, we disentangle the topic and syntax to reduce the repetition in generated paragraph descriptions for a given image. With the ubiquity of large-scale pre-trained models, we leverage self-supervised learning in finetuning process to increase the robustness of multimodal reasoning. The rapid development in multimodal learning promises embodied agents capable of interacting with physical environments. This thesis studies the typical embodied task vision-and-language navigation in discrete scenarios and proposes an episodic scene memory (ESceme) mechanism to balance generalization and efficiency. We figure out one desirable instantiation of the mechanism, namely candidate enhancing, and validate its superiority in various settings. Without extra time and computational cost before inference, ESceme improves performance in unseen environments by a large margin. We hope our findings can inspire more practical explorations on episodic memory in embodied AI.en_AU
dc.language.isoenen_AU
dc.subjectVision-language multimodal learningen_AU
dc.subjectembodied AIen_AU
dc.subjectvision-and-language navigationen_AU
dc.subjectcaptioningen_AU
dc.titleFrom Vision-Language Multimodal Learning Towards Embodied Agentsen_AU
dc.typeThesis
dc.type.thesisDoctor of Philosophyen_AU
dc.rights.otherThe author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.en_AU
usyd.facultySeS faculties schools::Faculty of Engineering::School of Computer Scienceen_AU
usyd.degreeDoctor of Philosophy Ph.D.en_AU
usyd.awardinginstThe University of Sydneyen_AU
usyd.advisorTao, Dacheng


Show simple item record

Associated file/s

Associated collections

Show simple item record

There are no previous versions of the item available.