Deep Learning for Vision and Language Applications: from Scene Graph to Captioning
Field | Value | Language |
dc.contributor.author | Zhang, Zhiwang | |
dc.date.accessioned | 2022-09-14T02:57:00Z | |
dc.date.available | 2022-09-14T02:57:00Z | |
dc.date.issued | 2022 | en_AU |
dc.identifier.uri | https://hdl.handle.net/2123/29546 | |
dc.description | Includes publication | |
dc.description.abstract | In this thesis, we propose novel deep learning algorithms for the vision and language tasks, including 3D scene graph generation and dense video captioning. The dense video captioning task is to firstly detect multiple key events from the untrimmed video and then describe each key event using natural language. The 3D scene graph generation task is to segment the object on an indoor 3D scene and then predict the predicates between every two objects. The main contributions of this thesis are listed below. Firstly, we formulated the dense video captioning task as a new visual cue-aided sentence summarization problem and proposed a new division-and-summarization (DaS) framework for this task. In the division stage, we generate multiple sentence descriptions for describing diverse visual content for each event proposal. In the summarization stage, we propose a two-stage Long Short-Term Memory (LSTM) network equipped with a new hierarchical attention mechanism. Secondly, after having the DaS framework for dense video captioning, we further proposed a GCN (Graph Convolutional Network)-enhanced summarization module for sentence summarization. It utilizes the word relationship to refine the hidden representations of the generated sentence. Specifically, we treat the semantic words as the nodes in a GCN graph and learn their interactions via the tightly coupled GCN and LSTM networks. Finally, we proposed a newly designed position-aware two-branch framework for the 3D scene graph generation. In this two-branch framework, we additionally propose the position-aware branch, which contains explicit and detailed position information for relationship modeling. In addition, we propose a Switchable Multi-stage fusion Graph Transformer (SMGT) for progressive and effective two-branch fusion. In SMGT, multiple fusion modules with fusion module selection procedure are applied inside each layer of our graph transformer. | en_AU |
dc.language.iso | en | en_AU |
dc.subject | Deep Learning | en_AU |
dc.subject | Computer Vision | en_AU |
dc.subject | Natural Language Processing | en_AU |
dc.subject | Vision and Language | en_AU |
dc.subject | Dense Video Captioning | en_AU |
dc.subject | 3D Scene Graph Generation | en_AU |
dc.title | Deep Learning for Vision and Language Applications: from Scene Graph to Captioning | en_AU |
dc.type | Thesis | |
dc.type.thesis | Doctor of Philosophy | en_AU |
dc.rights.other | The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission. | en_AU |
usyd.faculty | SeS faculties schools::Faculty of Engineering::School of Electrical and Information Engineering | en_AU |
usyd.degree | Doctor of Philosophy Ph.D. | en_AU |
usyd.awardinginst | The University of Sydney | en_AU |
usyd.advisor | Xu, Dong | |
usyd.advisor | Ouyang, Wanli | |
usyd.include.pub | Yes | en_AU |
Associated file/s
Associated collections