Intelligent Multimedia Data Analysis and Processing

Wang, Heng

Permalink

Access status:

USyd Access

Type

Thesis

Thesis type

Doctor of Philosophy

Author/s

Wang, Heng

Abstract

Artificial intelligence (AI) is now a transformative technique from daily creative work to scientific discovery. With the surge in the amount of data and the development of AI-related techniques, AI-generated content (AIGC) and AI4Science are gaining more and more traction. In ...
See moreArtificial intelligence (AI) is now a transformative technique from daily creative work to scientific discovery. With the surge in the amount of data and the development of AI-related techniques, AI-generated content (AIGC) and AI4Science are gaining more and more traction. In this thesis, we investigate the potential of deep learning based methods in cross-modal generation and neuroscience, addressing key facets of intelligent multimedia data analysis and processing. For the first part of this thesis, our focus spans two primary investigations: 3D dense captioning and visually-guided sound generation. We first investigate 3D dense captioning where objects within 3D indoor scenes are detected and described in human language. Recognizing the complexity inherent in 3D environments, we enhance the spatial understanding of our Transformer-based encoder-decoder architecture by incorporating spatiality information into the attention-based encoder. Contrary to the well-established research on vision-and-language, vision-and-audio, as a sun-rising field, has only recently received attention due to the complexity of audio signals. Particularly, we address the open-domain vision-to-audio generation task, approaching it through the synergy of foundation models (FMs). In the second part of this thesis, we employ deep learning based techniques for the challenging task of 3D single neuron segmentation in neuroscience from two perspectives - architectural optimization and efficient utilization of limited datasets through representation learning. We first design graph-based information reasoning modules to jointly consider the local appearance and the global structures. We then propose a novel voxel-wise cross-volume SimSiam representation learning strategy, improving learning performance while maintaining the overall model architecture. Such development will enable large-scale data-driven investigations in neuroscience and enhance our fundamental understanding of the human brain.
See less

Date

2024

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Engineering, School of Computer Science

Awarding institution

The University of Sydney

Subjects

Multimodality AI
Generative AI
3D Dense Captioning
Vision-to-Audio Generation
3D Single Neuron Reconstruction
AIGC