Designing Deep Model and Training Paradigm for Object Perception
Field | Value | Language |
dc.contributor.author | Zhou, Dongzhan | |
dc.date.accessioned | 2023-03-30T04:37:09Z | |
dc.date.available | 2023-03-30T04:37:09Z | |
dc.date.issued | 2023 | en_AU |
dc.identifier.uri | https://hdl.handle.net/2123/31055 | |
dc.description | Includes publication | |
dc.description.abstract | In this thesis, we focus on three object perception tasks: neural architecture search in object recognition, object detection, and audio-visual localization. We develop novel approaches for these tasks and conduct extensive experiments to validate their effectiveness. First, we observe that the proxies in Neural Architecture Search (NAS) present different abilities to maintain rank consistency among candidates. We examine widely used reduction factors and investigate their influences on the object recognition task. Based on our observations, we discover some reliable reduced settings that enjoy high acceleration ratio and rank consistency simultaneously. These good settings can work well on existing NAS methods to further reduce search costs and achieve competitive accuracy. Second, we propose a novel pre-training paradigm for object detection, denoted as Montage pre-training, which requires only the target detection dataset and removes external data burdens. We carefully extract training samples and devise a novel input pattern to aggregate four samples in a montage style to raise pre-training efficiency. Considering the characteristics of object detectors, we propose an ERF-adaptive dense classification strategy to further benefit the subsequent detector training. Our Montage pre-training only consumes 1/4 computation resources compared with the standard pre-training counterparts but achieves on-par or even better performance. Finally, we propose a visual reasoning module to explicitly exploit the rich visual context semantics for the self-supervised audio-visual localization task. The learning objectives are elaborately designed to provide useful guidance for the extracted visual semantics and enhance the audio-visual interactions, which leads to stronger feature representations. Experiments on three benchmark datasets show that our approach can significantly boost the localization performance. | en_AU |
dc.language.iso | en | en_AU |
dc.subject | computer vision | en_AU |
dc.subject | neural network | en_AU |
dc.subject | object recognition | en_AU |
dc.subject | object detection | en_AU |
dc.subject | multi-modal learning | en_AU |
dc.title | Designing Deep Model and Training Paradigm for Object Perception | en_AU |
dc.type | Thesis | |
dc.type.thesis | Doctor of Philosophy | en_AU |
dc.rights.other | The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission. | en_AU |
usyd.faculty | SeS faculties schools::Faculty of Engineering::School of Electrical and Information Engineering | en_AU |
usyd.degree | Doctor of Philosophy Ph.D. | en_AU |
usyd.awardinginst | The University of Sydney | en_AU |
usyd.advisor | OUYANG, WANLI | |
usyd.include.pub | Yes | en_AU |
Associated file/s
Associated collections