Deep Learning for Vision and Language Applications: from Scene Graph to Captioning

Zhang, Zhiwang

Access status:

USyd Access

Field	Value	Language
dc.contributor.author	Zhang, Zhiwang
dc.date.accessioned	2022-09-14T02:57:00Z
dc.date.available	2022-09-14T02:57:00Z
dc.date.issued	2022	en_AU
dc.identifier.uri	https://hdl.handle.net/2123/29546
dc.description	Includes publication
dc.description.abstract	In this thesis, we propose novel deep learning algorithms for the vision and language tasks, including 3D scene graph generation and dense video captioning. The dense video captioning task is to firstly detect multiple key events from the untrimmed video and then describe each key event using natural language. The 3D scene graph generation task is to segment the object on an indoor 3D scene and then predict the predicates between every two objects. The main contributions of this thesis are listed below. Firstly, we formulated the dense video captioning task as a new visual cue-aided sentence summarization problem and proposed a new division-and-summarization (DaS) framework for this task. In the division stage, we generate multiple sentence descriptions for describing diverse visual content for each event proposal. In the summarization stage, we propose a two-stage Long Short-Term Memory (LSTM) network equipped with a new hierarchical attention mechanism. Secondly, after having the DaS framework for dense video captioning, we further proposed a GCN (Graph Convolutional Network)-enhanced summarization module for sentence summarization. It utilizes the word relationship to refine the hidden representations of the generated sentence. Specifically, we treat the semantic words as the nodes in a GCN graph and learn their interactions via the tightly coupled GCN and LSTM networks. Finally, we proposed a newly designed position-aware two-branch framework for the 3D scene graph generation. In this two-branch framework, we additionally propose the position-aware branch, which contains explicit and detailed position information for relationship modeling. In addition, we propose a Switchable Multi-stage fusion Graph Transformer (SMGT) for progressive and effective two-branch fusion. In SMGT, multiple fusion modules with fusion module selection procedure are applied inside each layer of our graph transformer.	en_AU
dc.language.iso	en	en_AU
dc.subject	Deep Learning	en_AU
dc.subject	Computer Vision	en_AU
dc.subject	Natural Language Processing	en_AU
dc.subject	Vision and Language	en_AU
dc.subject	Dense Video Captioning	en_AU
dc.subject	3D Scene Graph Generation	en_AU
dc.title	Deep Learning for Vision and Language Applications: from Scene Graph to Captioning	en_AU
dc.type	Thesis
dc.type.thesis	Doctor of Philosophy	en_AU
dc.rights.other	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en_AU
usyd.faculty	SeS faculties schools::Faculty of Engineering::School of Electrical and Information Engineering	en_AU
usyd.degree	Doctor of Philosophy Ph.D.	en_AU
usyd.awardinginst	The University of Sydney	en_AU
usyd.advisor	Xu, Dong
usyd.advisor	Ouyang, Wanli
usyd.include.pub	Yes	en_AU