From Vision-Language Multimodal Learning Towards Embodied Agents

Zheng, Qi

Access status:

Open Access

Field	Value	Language
dc.contributor.author	Zheng, Qi
dc.date.accessioned	2023-08-29T05:59:16Z
dc.date.available	2023-08-29T05:59:16Z
dc.date.issued	2023	en_AU
dc.identifier.uri	https://hdl.handle.net/2123/31617
dc.description.abstract	To build machine agents with intelligent capabilities mimicking human perception and cognition, vision and language stand out as two essential modalities and foster computer vision and natural language processing. Advances in such realms stimulate research in vision-language multimodal learning that allows optical and linguistic inputs and outputs. Due to the innate difference between the two modalities and the lack of large-scale fine-grained annotations, multimodal agents tend to inherit unimodal shortcuts. In this thesis, we develop various solutions to intervene unimodal shortcuts for multimodal generation and reasoning. For visual shortcuts, we introduce a linguistic prior and devise a syntax-aware action targeting module for dynamic description to rectify the correlation between subject and object in a sentence. We apply concept hierarchy and propose a visual superordinate abstraction framework for unbiased concept learning to reduce the correlation among different attributes of an object. For linguistic shortcuts, we disentangle the topic and syntax to reduce the repetition in generated paragraph descriptions for a given image. With the ubiquity of large-scale pre-trained models, we leverage self-supervised learning in finetuning process to increase the robustness of multimodal reasoning. The rapid development in multimodal learning promises embodied agents capable of interacting with physical environments. This thesis studies the typical embodied task vision-and-language navigation in discrete scenarios and proposes an episodic scene memory (ESceme) mechanism to balance generalization and efficiency. We figure out one desirable instantiation of the mechanism, namely candidate enhancing, and validate its superiority in various settings. Without extra time and computational cost before inference, ESceme improves performance in unseen environments by a large margin. We hope our findings can inspire more practical explorations on episodic memory in embodied AI.	en_AU
dc.language.iso	en	en_AU
dc.subject	Vision-language multimodal learning	en_AU
dc.subject	embodied AI	en_AU
dc.subject	vision-and-language navigation	en_AU
dc.subject	captioning	en_AU
dc.title	From Vision-Language Multimodal Learning Towards Embodied Agents	en_AU
dc.type	Thesis
dc.type.thesis	Doctor of Philosophy	en_AU
dc.rights.other	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en_AU
usyd.faculty	SeS faculties schools::Faculty of Engineering::School of Computer Science	en_AU
usyd.degree	Doctor of Philosophy Ph.D.	en_AU
usyd.awardinginst	The University of Sydney	en_AU
usyd.advisor	Tao, Dacheng