A Comprehensive Exploration of Video Understanding: Perspectives on Sampling, Backbone, Representation, and Cross-Modal Learning

Wu, Wenhao

Access status:

Open Access

Field	Value	Language
dc.contributor.author	Wu, Wenhao
dc.date.accessioned	2025-02-11T22:24:37Z
dc.date.available	2025-02-11T22:24:37Z
dc.date.issued	2025	en_AU
dc.identifier.uri	https://hdl.handle.net/2123/33613
dc.description.abstract	Video understanding is a fundamental area in computer vision with applications in autonomous driving, security, healthcare, and entertainment. As video becomes a dominant medium for information exchange, automatic interpretation is increasingly essential. While action recognition has been the foundation of video understanding, recent multimodal advancements have expanded the field to tasks like video-text matching and video question answering. This thesis explores key advancements across multiple dimensions. First, it optimizes video recognition through salient frame selection, introducing Non-saliency Suppression Network (NSNet) to enhance efficiency and accuracy. It then investigates video model backbones, proposing the Arithmetic Temporal Module (ATM)—a plug-and-play component for temporal modeling compatible with CNNs and vision transformers. In self-supervised video representation learning, the Macro-to-Micro Semantic Correspondence (MaMiCo) task improves learning in the absence of labeled data. Moving toward weakly supervised learning, the Text4Vis framework adapts vision-language models for video recognition by leveraging text embeddings as classifiers, enhancing zero-/few-shot recognition. To incorporate real-world textual metadata, Cap4Video demonstrates that auxiliary captions improve text-video retrieval and recognition. Lastly, addressing the gap in video-based multimodal large language models (MLLMs), this thesis introduces Dense Connector, a plug-and-play module that enhances vision-language integration, and FreeVA, a training-free extension of image-based MLLMs to video, achieving state-of-the-art performance. These contributions collectively advance video understanding, providing insights and practical approaches for future research and applications.	en_AU
dc.language.iso	en	en_AU
dc.subject	Video Understanding	en_AU
dc.subject	Multimodal Learning	en_AU
dc.subject	Video Representation Learning	en_AU
dc.subject	Action Recognition	en_AU
dc.subject	Temporal Modeling	en_AU
dc.subject	Cross-Modal Learning	en_AU
dc.title	A Comprehensive Exploration of Video Understanding: Perspectives on Sampling, Backbone, Representation, and Cross-Modal Learning	en_AU
dc.type	Thesis
dc.type.thesis	Doctor of Philosophy	en_AU
dc.rights.other	The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.	en_AU
usyd.faculty	SeS faculties schools::Faculty of Engineering::School of Computer Science	en_AU
usyd.degree	Doctor of Philosophy Ph.D.	en_AU
usyd.awardinginst	The University of Sydney	en_AU
usyd.advisor	Xu, Chang
usyd.include.pub	No	en_AU