Show simple item record

FieldValueLanguage
dc.contributor.authorZhang, Zhiwang
dc.date.accessioned2022-09-14T02:57:00Z
dc.date.available2022-09-14T02:57:00Z
dc.date.issued2022en_AU
dc.identifier.urihttps://hdl.handle.net/2123/29546
dc.descriptionIncludes publication
dc.description.abstractIn this thesis, we propose novel deep learning algorithms for the vision and language tasks, including 3D scene graph generation and dense video captioning. The dense video captioning task is to firstly detect multiple key events from the untrimmed video and then describe each key event using natural language. The 3D scene graph generation task is to segment the object on an indoor 3D scene and then predict the predicates between every two objects. The main contributions of this thesis are listed below. Firstly, we formulated the dense video captioning task as a new visual cue-aided sentence summarization problem and proposed a new division-and-summarization (DaS) framework for this task. In the division stage, we generate multiple sentence descriptions for describing diverse visual content for each event proposal. In the summarization stage, we propose a two-stage Long Short-Term Memory (LSTM) network equipped with a new hierarchical attention mechanism. Secondly, after having the DaS framework for dense video captioning, we further proposed a GCN (Graph Convolutional Network)-enhanced summarization module for sentence summarization. It utilizes the word relationship to refine the hidden representations of the generated sentence. Specifically, we treat the semantic words as the nodes in a GCN graph and learn their interactions via the tightly coupled GCN and LSTM networks. Finally, we proposed a newly designed position-aware two-branch framework for the 3D scene graph generation. In this two-branch framework, we additionally propose the position-aware branch, which contains explicit and detailed position information for relationship modeling. In addition, we propose a Switchable Multi-stage fusion Graph Transformer (SMGT) for progressive and effective two-branch fusion. In SMGT, multiple fusion modules with fusion module selection procedure are applied inside each layer of our graph transformer.en_AU
dc.language.isoenen_AU
dc.subjectDeep Learningen_AU
dc.subjectComputer Visionen_AU
dc.subjectNatural Language Processingen_AU
dc.subjectVision and Languageen_AU
dc.subjectDense Video Captioningen_AU
dc.subject3D Scene Graph Generationen_AU
dc.titleDeep Learning for Vision and Language Applications: from Scene Graph to Captioningen_AU
dc.typeThesis
dc.type.thesisDoctor of Philosophyen_AU
dc.rights.otherThe author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.en_AU
usyd.facultySeS faculties schools::Faculty of Engineering::School of Electrical and Information Engineeringen_AU
usyd.degreeDoctor of Philosophy Ph.D.en_AU
usyd.awardinginstThe University of Sydneyen_AU
usyd.advisorXu, Dong
usyd.advisorOuyang, Wanli
usyd.include.pubYesen_AU


Show simple item record

Associated file/s

Associated collections

Show simple item record

There are no previous versions of the item available.