3D Reconstruction and Understanding
Access status:
USyd Access
Type
ThesisThesis type
Doctor of PhilosophyAuthor/s
Huang, DiAbstract
This thesis explores 3D reconstruction and understanding, essential for intelligent systems to perceive and reason about the world. By emphasizing their complementarity, it overcomes precision, scalability, and generalization limits in real-world 3D pipelines, aiming to recover ...
See moreThis thesis explores 3D reconstruction and understanding, essential for intelligent systems to perceive and reason about the world. By emphasizing their complementarity, it overcomes precision, scalability, and generalization limits in real-world 3D pipelines, aiming to recover accurate geometry enriched with semantic meaning for robotics, autonomous navigation, and augmented reality. Reconstruction methods must handle occlusions, reflections, and sparse data, while understanding requires capturing both fine-grained details and high-level context across diverse objects and motions. We tackle these linked challenges by jointly optimizing geometric and semantic processes through shared representations, revealing how each can inform and strengthen the other. Four contributions structure this work: a geometry-driven method for high-fidelity monocular reconstruction of hand-held objects without learned priors; Ponder, a point-cloud pretraining paradigm that uses differentiable rendering of RGB-D data to enhance detection, segmentation, and reconstruction; MotionGPT, a multimodal model uniting language and geometry encoders to generate realistic human motion under varied control signals; and Agent3D-Zero, a zero-shot 3D understanding system that iteratively selects viewpoints and synthesizes knowledge from meshes via visual prompts in large language models, eliminating the need for extensive 3D training data. Extensive experiments demonstrate state-of-the-art performance in object reconstruction, semantic segmentation, motion synthesis, and scene understanding. By integrating geometric and semantic reasoning, pretraining strategies, and multimodal cues, this work establishes a unified framework for 3D scene interpretation that advances theoretical boundaries and delivers practical benefits—from digital content creation to human–robot interaction—paving the way for next-generation intelligent systems.
See less
See moreThis thesis explores 3D reconstruction and understanding, essential for intelligent systems to perceive and reason about the world. By emphasizing their complementarity, it overcomes precision, scalability, and generalization limits in real-world 3D pipelines, aiming to recover accurate geometry enriched with semantic meaning for robotics, autonomous navigation, and augmented reality. Reconstruction methods must handle occlusions, reflections, and sparse data, while understanding requires capturing both fine-grained details and high-level context across diverse objects and motions. We tackle these linked challenges by jointly optimizing geometric and semantic processes through shared representations, revealing how each can inform and strengthen the other. Four contributions structure this work: a geometry-driven method for high-fidelity monocular reconstruction of hand-held objects without learned priors; Ponder, a point-cloud pretraining paradigm that uses differentiable rendering of RGB-D data to enhance detection, segmentation, and reconstruction; MotionGPT, a multimodal model uniting language and geometry encoders to generate realistic human motion under varied control signals; and Agent3D-Zero, a zero-shot 3D understanding system that iteratively selects viewpoints and synthesizes knowledge from meshes via visual prompts in large language models, eliminating the need for extensive 3D training data. Extensive experiments demonstrate state-of-the-art performance in object reconstruction, semantic segmentation, motion synthesis, and scene understanding. By integrating geometric and semantic reasoning, pretraining strategies, and multimodal cues, this work establishes a unified framework for 3D scene interpretation that advances theoretical boundaries and delivers practical benefits—from digital content creation to human–robot interaction—paving the way for next-generation intelligent systems.
See less
Date
2025Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering, School of Electrical and Information EngineeringAwarding institution
The University of SydneyShare