Advanced Learning-Based Approaches for 3D Realistic Perception
Access status:
USyd Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Zhao, RunkaiAbstract
3D realistic perception refers to the faithful ego-centric understanding of real-word object attributes and their precise spatial localization, while maintaining geometric and semantic consistency. In contrast to purely semantic 2D perception, it provides spatially grounded ...
See more3D realistic perception refers to the faithful ego-centric understanding of real-word object attributes and their precise spatial localization, while maintaining geometric and semantic consistency. In contrast to purely semantic 2D perception, it provides spatially grounded representations—such as bird’s-eye-view (BEV) maps, 3D curves, occupancy, and object trajectories—that are actionable for planning and control. Such perception can be derived from a variety of sensing modalities, including LiDAR, cameras, or multi-modal fusion approaches. Despite these advances, the field is impeded with inherent challenges such as occlusion, long-range sparsity, adverse weather conditions, and domain shifts. In safety-critical applications like autonomous driving and embodied intelligence, the success of 3D realistic perception ultimately depends on achieving high metric accuracy, robustness, and computational efficiency. This thesis advances learning-based 3D perception for lane-line and dynamic-object understanding in challenging driving environments. To address the scarcity and annotation burden of LiDAR data, it introduces LiSV-3DLane, the first large-scale surround-view 3D lane dataset with enriched semantic annotations. Based on this resource, LiLaDet projects LiDAR geometry into a BEV representation for precise 3D lane recovery and broader applicability of LiDAR-based perception. To reduce the high cost of point-cloud processing, LaneCMKT transfers 3D cues from a LiDAR teacher to a monocular image student via cross-modal distillation, improving detection robustness under adverse conditions. Finally, BeXT (Bringing eXpertises Together) integrates complementary Visual Foundation Models (VFM) into a lightweight monocular encoder through expertise adapter pretraining and dynamic feature routing. Collectively, these contributions establish a pathway from LiDAR to scalable monocular deployment, enabling robust, efficient, and metrically faithful 3D realistic perception.
See less
See more3D realistic perception refers to the faithful ego-centric understanding of real-word object attributes and their precise spatial localization, while maintaining geometric and semantic consistency. In contrast to purely semantic 2D perception, it provides spatially grounded representations—such as bird’s-eye-view (BEV) maps, 3D curves, occupancy, and object trajectories—that are actionable for planning and control. Such perception can be derived from a variety of sensing modalities, including LiDAR, cameras, or multi-modal fusion approaches. Despite these advances, the field is impeded with inherent challenges such as occlusion, long-range sparsity, adverse weather conditions, and domain shifts. In safety-critical applications like autonomous driving and embodied intelligence, the success of 3D realistic perception ultimately depends on achieving high metric accuracy, robustness, and computational efficiency. This thesis advances learning-based 3D perception for lane-line and dynamic-object understanding in challenging driving environments. To address the scarcity and annotation burden of LiDAR data, it introduces LiSV-3DLane, the first large-scale surround-view 3D lane dataset with enriched semantic annotations. Based on this resource, LiLaDet projects LiDAR geometry into a BEV representation for precise 3D lane recovery and broader applicability of LiDAR-based perception. To reduce the high cost of point-cloud processing, LaneCMKT transfers 3D cues from a LiDAR teacher to a monocular image student via cross-modal distillation, improving detection robustness under adverse conditions. Finally, BeXT (Bringing eXpertises Together) integrates complementary Visual Foundation Models (VFM) into a lightweight monocular encoder through expertise adapter pretraining and dynamic feature routing. Collectively, these contributions establish a pathway from LiDAR to scalable monocular deployment, enabling robust, efficient, and metrically faithful 3D realistic perception.
See less
Date
2025Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering, School of Computer ScienceAwarding institution
The University of SydneyShare