3D Reconstruction and Understanding

Huang, Di

Permalink

Access status:

USyd Access

Type

Thesis

Thesis type

Doctor of Philosophy

Author/s

Huang, Di

Abstract

This thesis explores 3D reconstruction and understanding, essential for intelligent systems to perceive and reason about the world. By emphasizing their complementarity, it overcomes precision, scalability, and generalization limits in real-world 3D pipelines, aiming to recover ...
See moreThis thesis explores 3D reconstruction and understanding, essential for intelligent systems to perceive and reason about the world. By emphasizing their complementarity, it overcomes precision, scalability, and generalization limits in real-world 3D pipelines, aiming to recover accurate geometry enriched with semantic meaning for robotics, autonomous navigation, and augmented reality. Reconstruction methods must handle occlusions, reflections, and sparse data, while understanding requires capturing both fine-grained details and high-level context across diverse objects and motions. We tackle these linked challenges by jointly optimizing geometric and semantic processes through shared representations, revealing how each can inform and strengthen the other. Four contributions structure this work: a geometry-driven method for high-fidelity monocular reconstruction of hand-held objects without learned priors; Ponder, a point-cloud pretraining paradigm that uses differentiable rendering of RGB-D data to enhance detection, segmentation, and reconstruction; MotionGPT, a multimodal model uniting language and geometry encoders to generate realistic human motion under varied control signals; and Agent3D-Zero, a zero-shot 3D understanding system that iteratively selects viewpoints and synthesizes knowledge from meshes via visual prompts in large language models, eliminating the need for extensive 3D training data. Extensive experiments demonstrate state-of-the-art performance in object reconstruction, semantic segmentation, motion synthesis, and scene understanding. By integrating geometric and semantic reasoning, pretraining strategies, and multimodal cues, this work establishes a unified framework for 3D scene interpretation that advances theoretical boundaries and delivers practical benefits—from digital content creation to human–robot interaction—paving the way for next-generation intelligent systems.
See less

Date

2025

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Engineering, School of Electrical and Information Engineering

Awarding institution

The University of Sydney

Subjects

3D reconstruction
3D scene understanding
monocular video reconstruction
point-cloud pretraining
human motion synthesis
zero-shot 3D interpretation