A Unified Multi-Modal Approach for 3D Referring Expression Segmentation

Zhou, Keshen

Permalink

Access status:

Open Access

Type

Thesis

Thesis type

Masters by Research

Author/s

Zhou, Keshen

Abstract

Generalised 3D Referring Expression Segmentation (3D-GRES) segments exact 3D objects described by free-form language, even when descriptions match multiple targets, single targets, or zero targets. Most existing methods rely solely on sparse, colour-poor point clouds, neglecting ...
See moreGeneralised 3D Referring Expression Segmentation (3D-GRES) segments exact 3D objects described by free-form language, even when descriptions match multiple targets, single targets, or zero targets. Most existing methods rely solely on sparse, colour-poor point clouds, neglecting the complementary semantics richness of multi-view RGB images. In this paper, we propose IS-RES, a unified multi-modal framework that integrates RGB images and point clouds for 3D-GRES. Specifically, IS-RES extracts the instance mask by Segment Anything Model(SAM), obtains both dense and instance-aware 2D embeddings through CLIP, and unprojects 2D embeddings into 3D point clouds via confidence-weighted pixel-to-point association. A progressive multi-level fusion strategy is applied to transform fragmented multi-modal features into hierarchical representations, enabling adaptive alignment between instance-level semantics and geometric structures. Extensive experiments demonstrate that IS-RES achieves state-of-the-art performance on both ScanRefer and Multi3DRefer benchmarks, with significant improvements in challenging scenarios involving multiple instances and complex spatial relationships.
See less

Date

2025

Licence

The author retains copyright of this thesis

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Engineering, School of Computer Science

Awarding institution

The University of Sydney

Subjects

AI
3D Segmentation
Computer Vision
Multi-modal Learning