Designing Deep Model and Training Paradigm for Object Perception

Zhou, Dongzhan

Access status:

USyd Access

Field	Value	Language
dc.contributor.author	Zhou, Dongzhan
dc.date.accessioned	2023-03-30T04:37:09Z
dc.date.available	2023-03-30T04:37:09Z
dc.date.issued	2023	en
dc.identifier.uri	https://hdl.handle.net/2123/31055
dc.description	Includes publication
dc.description.abstract	In this thesis, we focus on three object perception tasks: neural architecture search in object recognition, object detection, and audio-visual localization. We develop novel approaches for these tasks and conduct extensive experiments to validate their effectiveness. First, we observe that the proxies in Neural Architecture Search (NAS) present different abilities to maintain rank consistency among candidates. We examine widely used reduction factors and investigate their influences on the object recognition task. Based on our observations, we discover some reliable reduced settings that enjoy high acceleration ratio and rank consistency simultaneously. These good settings can work well on existing NAS methods to further reduce search costs and achieve competitive accuracy. Second, we propose a novel pre-training paradigm for object detection, denoted as Montage pre-training, which requires only the target detection dataset and removes external data burdens. We carefully extract training samples and devise a novel input pattern to aggregate four samples in a montage style to raise pre-training efficiency. Considering the characteristics of object detectors, we propose an ERF-adaptive dense classification strategy to further benefit the subsequent detector training. Our Montage pre-training only consumes 1/4 computation resources compared with the standard pre-training counterparts but achieves on-par or even better performance. Finally, we propose a visual reasoning module to explicitly exploit the rich visual context semantics for the self-supervised audio-visual localization task. The learning objectives are elaborately designed to provide useful guidance for the extracted visual semantics and enhance the audio-visual interactions, which leads to stronger feature representations. Experiments on three benchmark datasets show that our approach can significantly boost the localization performance.	en
dc.language.iso	en	en
dc.rights	Copyright All Rights Reserved	en
dc.subject	computer vision	en
dc.subject	neural network	en
dc.subject	object recognition	en
dc.subject	object detection	en
dc.subject	multi-modal learning	en
dc.title	Designing Deep Model and Training Paradigm for Object Perception	en
dc.type	Thesis
dc.type.thesis	Doctor of Philosophy	en
dc.rights.other	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en
usyd.faculty	SeS faculties schools::Faculty of Engineering::School of Electrical and Information Engineering	en
usyd.degree	Doctor of Philosophy Ph.D.	en
usyd.awardinginst	The University of Sydney	en
usyd.advisor	Ouyang, Wanli	en
usyd.include.pub	Yes	en