A Unified Multi-Modal Approach for 3D Referring Expression Segmentation
Access status:
Open Access
Type
ThesisThesis type
Masters by ResearchAuthor/s
Zhou, KeshenAbstract
Generalised 3D Referring Expression Segmentation (3D-GRES) segments exact 3D objects described by free-form language, even when descriptions match multiple targets, single targets, or zero targets. Most existing methods rely solely on sparse, colour-poor point clouds, neglecting ...
See moreGeneralised 3D Referring Expression Segmentation (3D-GRES) segments exact 3D objects described by free-form language, even when descriptions match multiple targets, single targets, or zero targets. Most existing methods rely solely on sparse, colour-poor point clouds, neglecting the complementary semantics richness of multi-view RGB images. In this paper, we propose IS-RES, a unified multi-modal framework that integrates RGB images and point clouds for 3D-GRES. Specifically, IS-RES extracts the instance mask by Segment Anything Model(SAM), obtains both dense and instance-aware 2D embeddings through CLIP, and unprojects 2D embeddings into 3D point clouds via confidence-weighted pixel-to-point association. A progressive multi-level fusion strategy is applied to transform fragmented multi-modal features into hierarchical representations, enabling adaptive alignment between instance-level semantics and geometric structures. Extensive experiments demonstrate that IS-RES achieves state-of-the-art performance on both ScanRefer and Multi3DRefer benchmarks, with significant improvements in challenging scenarios involving multiple instances and complex spatial relationships.
See less
See moreGeneralised 3D Referring Expression Segmentation (3D-GRES) segments exact 3D objects described by free-form language, even when descriptions match multiple targets, single targets, or zero targets. Most existing methods rely solely on sparse, colour-poor point clouds, neglecting the complementary semantics richness of multi-view RGB images. In this paper, we propose IS-RES, a unified multi-modal framework that integrates RGB images and point clouds for 3D-GRES. Specifically, IS-RES extracts the instance mask by Segment Anything Model(SAM), obtains both dense and instance-aware 2D embeddings through CLIP, and unprojects 2D embeddings into 3D point clouds via confidence-weighted pixel-to-point association. A progressive multi-level fusion strategy is applied to transform fragmented multi-modal features into hierarchical representations, enabling adaptive alignment between instance-level semantics and geometric structures. Extensive experiments demonstrate that IS-RES achieves state-of-the-art performance on both ScanRefer and Multi3DRefer benchmarks, with significant improvements in challenging scenarios involving multiple instances and complex spatial relationships.
See less
Date
2025Licence
The author retains copyright of this thesisRights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering, School of Computer ScienceAwarding institution
The University of SydneyShare