Advanced Learning-Based Approaches for 3D Realistic Perception

Zhao, Runkai

Permalink

Access status:

USyd Access

Type

Thesis

Thesis type

Doctor of Philosophy

Author/s

Zhao, Runkai

Abstract

3D realistic perception refers to the faithful ego-centric understanding of real-word object attributes and their precise spatial localization, while maintaining geometric and semantic consistency. In contrast to purely semantic 2D perception, it provides spatially grounded ...
See more3D realistic perception refers to the faithful ego-centric understanding of real-word object attributes and their precise spatial localization, while maintaining geometric and semantic consistency. In contrast to purely semantic 2D perception, it provides spatially grounded representations—such as bird’s-eye-view (BEV) maps, 3D curves, occupancy, and object trajectories—that are actionable for planning and control. Such perception can be derived from a variety of sensing modalities, including LiDAR, cameras, or multi-modal fusion approaches. Despite these advances, the field is impeded with inherent challenges such as occlusion, long-range sparsity, adverse weather conditions, and domain shifts. In safety-critical applications like autonomous driving and embodied intelligence, the success of 3D realistic perception ultimately depends on achieving high metric accuracy, robustness, and computational efficiency. This thesis advances learning-based 3D perception for lane-line and dynamic-object understanding in challenging driving environments. To address the scarcity and annotation burden of LiDAR data, it introduces LiSV-3DLane, the first large-scale surround-view 3D lane dataset with enriched semantic annotations. Based on this resource, LiLaDet projects LiDAR geometry into a BEV representation for precise 3D lane recovery and broader applicability of LiDAR-based perception. To reduce the high cost of point-cloud processing, LaneCMKT transfers 3D cues from a LiDAR teacher to a monocular image student via cross-modal distillation, improving detection robustness under adverse conditions. Finally, BeXT (Bringing eXpertises Together) integrates complementary Visual Foundation Models (VFM) into a lightweight monocular encoder through expertise adapter pretraining and dynamic feature routing. Collectively, these contributions establish a pathway from LiDAR to scalable monocular deployment, enabling robust, efficient, and metrically faithful 3D realistic perception.
See less

Date

2025

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Engineering, School of Computer Science

Awarding institution

The University of Sydney

Subjects

Artificial Intelligence
Deep Learning
3D Computer Vision
Multi-channel Data Understanding
Autonomous Driving
3D Realistic Perception