Intelligent Multimedia Data Analysis and Processing

Wang, Heng

Access status:

USyd Access

Field	Value	Language
dc.contributor.author	Wang, Heng
dc.date.accessioned	2024-05-14T05:40:04Z
dc.date.available	2024-05-14T05:40:04Z
dc.date.issued	2024	en
dc.identifier.uri	https://hdl.handle.net/2123/32554
dc.description	Includes publication
dc.description.abstract	Artificial intelligence (AI) is now a transformative technique from daily creative work to scientific discovery. With the surge in the amount of data and the development of AI-related techniques, AI-generated content (AIGC) and AI4Science are gaining more and more traction. In this thesis, we investigate the potential of deep learning based methods in cross-modal generation and neuroscience, addressing key facets of intelligent multimedia data analysis and processing. For the first part of this thesis, our focus spans two primary investigations: 3D dense captioning and visually-guided sound generation. We first investigate 3D dense captioning where objects within 3D indoor scenes are detected and described in human language. Recognizing the complexity inherent in 3D environments, we enhance the spatial understanding of our Transformer-based encoder-decoder architecture by incorporating spatiality information into the attention-based encoder. Contrary to the well-established research on vision-and-language, vision-and-audio, as a sun-rising field, has only recently received attention due to the complexity of audio signals. Particularly, we address the open-domain vision-to-audio generation task, approaching it through the synergy of foundation models (FMs). In the second part of this thesis, we employ deep learning based techniques for the challenging task of 3D single neuron segmentation in neuroscience from two perspectives - architectural optimization and efficient utilization of limited datasets through representation learning. We first design graph-based information reasoning modules to jointly consider the local appearance and the global structures. We then propose a novel voxel-wise cross-volume SimSiam representation learning strategy, improving learning performance while maintaining the overall model architecture. Such development will enable large-scale data-driven investigations in neuroscience and enhance our fundamental understanding of the human brain.	en
dc.language.iso	en	en
dc.subject	Multimodality AI	en
dc.subject	Generative AI	en
dc.subject	3D Dense Captioning	en
dc.subject	Vision-to-Audio Generation	en
dc.subject	3D Single Neuron Reconstruction	en
dc.subject	AIGC	en
dc.title	Intelligent Multimedia Data Analysis and Processing	en
dc.type	Thesis
dc.type.thesis	Doctor of Philosophy	en
dc.rights.other	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en
usyd.faculty	SeS faculties schools::Faculty of Engineering::School of Computer Science	en
usyd.degree	Doctor of Philosophy Ph.D.	en
usyd.awardinginst	The University of Sydney	en
usyd.advisor	Cai, Weidong
usyd.include.pub	Yes	en