Designing Deep Model and Training Paradigm for Object Perception
Access status:
USyd Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Zhou, DongzhanAbstract
In this thesis, we focus on three object perception tasks: neural architecture search in object recognition, object detection, and audio-visual localization. We develop novel approaches for these tasks and conduct extensive experiments to validate their effectiveness. First, we ...
See moreIn this thesis, we focus on three object perception tasks: neural architecture search in object recognition, object detection, and audio-visual localization. We develop novel approaches for these tasks and conduct extensive experiments to validate their effectiveness. First, we observe that the proxies in Neural Architecture Search (NAS) present different abilities to maintain rank consistency among candidates. We examine widely used reduction factors and investigate their influences on the object recognition task. Based on our observations, we discover some reliable reduced settings that enjoy high acceleration ratio and rank consistency simultaneously. These good settings can work well on existing NAS methods to further reduce search costs and achieve competitive accuracy. Second, we propose a novel pre-training paradigm for object detection, denoted as Montage pre-training, which requires only the target detection dataset and removes external data burdens. We carefully extract training samples and devise a novel input pattern to aggregate four samples in a montage style to raise pre-training efficiency. Considering the characteristics of object detectors, we propose an ERF-adaptive dense classification strategy to further benefit the subsequent detector training. Our Montage pre-training only consumes 1/4 computation resources compared with the standard pre-training counterparts but achieves on-par or even better performance. Finally, we propose a visual reasoning module to explicitly exploit the rich visual context semantics for the self-supervised audio-visual localization task. The learning objectives are elaborately designed to provide useful guidance for the extracted visual semantics and enhance the audio-visual interactions, which leads to stronger feature representations. Experiments on three benchmark datasets show that our approach can significantly boost the localization performance.
See less
See moreIn this thesis, we focus on three object perception tasks: neural architecture search in object recognition, object detection, and audio-visual localization. We develop novel approaches for these tasks and conduct extensive experiments to validate their effectiveness. First, we observe that the proxies in Neural Architecture Search (NAS) present different abilities to maintain rank consistency among candidates. We examine widely used reduction factors and investigate their influences on the object recognition task. Based on our observations, we discover some reliable reduced settings that enjoy high acceleration ratio and rank consistency simultaneously. These good settings can work well on existing NAS methods to further reduce search costs and achieve competitive accuracy. Second, we propose a novel pre-training paradigm for object detection, denoted as Montage pre-training, which requires only the target detection dataset and removes external data burdens. We carefully extract training samples and devise a novel input pattern to aggregate four samples in a montage style to raise pre-training efficiency. Considering the characteristics of object detectors, we propose an ERF-adaptive dense classification strategy to further benefit the subsequent detector training. Our Montage pre-training only consumes 1/4 computation resources compared with the standard pre-training counterparts but achieves on-par or even better performance. Finally, we propose a visual reasoning module to explicitly exploit the rich visual context semantics for the self-supervised audio-visual localization task. The learning objectives are elaborately designed to provide useful guidance for the extracted visual semantics and enhance the audio-visual interactions, which leads to stronger feature representations. Experiments on three benchmark datasets show that our approach can significantly boost the localization performance.
See less
Date
2023Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering, School of Electrical and Information EngineeringAwarding institution
The University of SydneyShare