Deep Learning for Vision and Language Applications: from Scene Graph to Captioning

Zhang, Zhiwang

Permalink

Access status:

USyd Access

Type

Thesis

Thesis type

Doctor of Philosophy

Author/s

Zhang, Zhiwang

Abstract

In this thesis, we propose novel deep learning algorithms for the vision and language tasks, including 3D scene graph generation and dense video captioning. The dense video captioning task is to firstly detect multiple key events from the untrimmed video and then describe each key ...
See moreIn this thesis, we propose novel deep learning algorithms for the vision and language tasks, including 3D scene graph generation and dense video captioning. The dense video captioning task is to firstly detect multiple key events from the untrimmed video and then describe each key event using natural language. The 3D scene graph generation task is to segment the object on an indoor 3D scene and then predict the predicates between every two objects. The main contributions of this thesis are listed below. Firstly, we formulated the dense video captioning task as a new visual cue-aided sentence summarization problem and proposed a new division-and-summarization (DaS) framework for this task. In the division stage, we generate multiple sentence descriptions for describing diverse visual content for each event proposal. In the summarization stage, we propose a two-stage Long Short-Term Memory (LSTM) network equipped with a new hierarchical attention mechanism. Secondly, after having the DaS framework for dense video captioning, we further proposed a GCN (Graph Convolutional Network)-enhanced summarization module for sentence summarization. It utilizes the word relationship to refine the hidden representations of the generated sentence. Specifically, we treat the semantic words as the nodes in a GCN graph and learn their interactions via the tightly coupled GCN and LSTM networks. Finally, we proposed a newly designed position-aware two-branch framework for the 3D scene graph generation. In this two-branch framework, we additionally propose the position-aware branch, which contains explicit and detailed position information for relationship modeling. In addition, we propose a Switchable Multi-stage fusion Graph Transformer (SMGT) for progressive and effective two-branch fusion. In SMGT, multiple fusion modules with fusion module selection procedure are applied inside each layer of our graph transformer.
See less

Date

2022

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Engineering, School of Electrical and Information Engineering

Awarding institution

The University of Sydney

Subjects

Deep Learning
Computer Vision
Natural Language Processing
Vision and Language
Dense Video Captioning
3D Scene Graph Generation