Designing Deep Model and Training Paradigm for Object Perception

Zhou, Dongzhan

Permalink

Access status:

USyd Access

Type

Thesis

Thesis type

Doctor of Philosophy

Author/s

Zhou, Dongzhan

Abstract

In this thesis, we focus on three object perception tasks: neural architecture search in object recognition, object detection, and audio-visual localization. We develop novel approaches for these tasks and conduct extensive experiments to validate their effectiveness. First, we ...
See moreIn this thesis, we focus on three object perception tasks: neural architecture search in object recognition, object detection, and audio-visual localization. We develop novel approaches for these tasks and conduct extensive experiments to validate their effectiveness. First, we observe that the proxies in Neural Architecture Search (NAS) present different abilities to maintain rank consistency among candidates. We examine widely used reduction factors and investigate their influences on the object recognition task. Based on our observations, we discover some reliable reduced settings that enjoy high acceleration ratio and rank consistency simultaneously. These good settings can work well on existing NAS methods to further reduce search costs and achieve competitive accuracy. Second, we propose a novel pre-training paradigm for object detection, denoted as Montage pre-training, which requires only the target detection dataset and removes external data burdens. We carefully extract training samples and devise a novel input pattern to aggregate four samples in a montage style to raise pre-training efficiency. Considering the characteristics of object detectors, we propose an ERF-adaptive dense classification strategy to further benefit the subsequent detector training. Our Montage pre-training only consumes 1/4 computation resources compared with the standard pre-training counterparts but achieves on-par or even better performance. Finally, we propose a visual reasoning module to explicitly exploit the rich visual context semantics for the self-supervised audio-visual localization task. The learning objectives are elaborately designed to provide useful guidance for the extracted visual semantics and enhance the audio-visual interactions, which leads to stronger feature representations. Experiments on three benchmark datasets show that our approach can significantly boost the localization performance.
See less

Date

2023

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Engineering, School of Electrical and Information Engineering

Awarding institution

The University of Sydney

Subjects

computer vision
neural network
object recognition
object detection
multi-modal learning