Learning Ability of Deep ReLU Networks: Pairwise Tasks and Gradient Descent Methods

Zhou, Junyu

Permalink

Access status:

Open Access

Type

Thesis

Thesis type

Doctor of Philosophy

Author/s

Zhou, Junyu

Abstract

Deep neural networks (DNNs) have become central to modern machine learning due to their strong empirical performance. However, their theoretical understanding—especially regarding generalization—remains limited. This thesis advances the theory of deep ReLU networks through ...
See moreDeep neural networks (DNNs) have become central to modern machine learning due to their strong empirical performance. However, their theoretical understanding—especially regarding generalization—remains limited. This thesis advances the theory of deep ReLU networks through two lenses: pairwise learning tasks and gradient descent methods. For pairwise learning, we study generalization in non-parametric estimation without relying on restrictive convexity or VC-class assumptions. We establish sharp oracle inequalities for empirical minimizers under general hypothesis spaces and Lipschitz pairwise losses. Applied to pairwise least squares regression, our bounds match known minimax rates up to logarithmic terms. A key innovation is constructing a structured deep ReLU network approximating the true predictor, forming a target hypothesis space with controlled complexity. This framework successfully handles problems beyond the reach of existing theories. For metric and similarity learning, we exploit the structure of the true metric. By deriving its form under hinge loss, we approximate it using structured deep ReLU networks and analyze the excess generalization error by bounding the approximation and the estimation errors. An optimal excess risk rate is achieved, marking the first known such analysis for metric/similarity learning. We also explore extensions to general losses. For gradient descent methods, we study GD and SGD for overparameterized deep ReLU networks in the NTK regime. Prior work mainly covers shallow networks; we fill this gap by establishing the first minimax-optimal generalization rates for GD/SGD with deep architectures. Under polynomial width scaling, our results show these methods can match the generalization performance of kernel approaches.
See less

Date

2025

Licence

The author retains copyright of this thesis

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Science, School of Mathematics and Statistics

Department, Discipline or Centre

Mathematics and Statistics

Awarding institution

The University of Sydney

Subjects

Deep Learning Theory
Deep Neural Networks
Generalization Analysis
Metric and Similarity Learning
Stochastic Gradient Descent