A Comprehensive Exploration of Video Understanding: Perspectives on Sampling, Backbone, Representation, and Cross-Modal Learning
Field | Value | Language |
dc.contributor.author | Wu, Wenhao | |
dc.date.accessioned | 2025-02-11T22:24:37Z | |
dc.date.available | 2025-02-11T22:24:37Z | |
dc.date.issued | 2025 | en_AU |
dc.identifier.uri | https://hdl.handle.net/2123/33613 | |
dc.description.abstract | Video understanding is a fundamental area in computer vision with applications in autonomous driving, security, healthcare, and entertainment. As video becomes a dominant medium for information exchange, automatic interpretation is increasingly essential. While action recognition has been the foundation of video understanding, recent multimodal advancements have expanded the field to tasks like video-text matching and video question answering. This thesis explores key advancements across multiple dimensions. First, it optimizes video recognition through salient frame selection, introducing Non-saliency Suppression Network (NSNet) to enhance efficiency and accuracy. It then investigates video model backbones, proposing the Arithmetic Temporal Module (ATM)—a plug-and-play component for temporal modeling compatible with CNNs and vision transformers. In self-supervised video representation learning, the Macro-to-Micro Semantic Correspondence (MaMiCo) task improves learning in the absence of labeled data. Moving toward weakly supervised learning, the Text4Vis framework adapts vision-language models for video recognition by leveraging text embeddings as classifiers, enhancing zero-/few-shot recognition. To incorporate real-world textual metadata, Cap4Video demonstrates that auxiliary captions improve text-video retrieval and recognition. Lastly, addressing the gap in video-based multimodal large language models (MLLMs), this thesis introduces Dense Connector, a plug-and-play module that enhances vision-language integration, and FreeVA, a training-free extension of image-based MLLMs to video, achieving state-of-the-art performance. These contributions collectively advance video understanding, providing insights and practical approaches for future research and applications. | en_AU |
dc.language.iso | en | en_AU |
dc.subject | Video Understanding | en_AU |
dc.subject | Multimodal Learning | en_AU |
dc.subject | Video Representation Learning | en_AU |
dc.subject | Action Recognition | en_AU |
dc.subject | Temporal Modeling | en_AU |
dc.subject | Cross-Modal Learning | en_AU |
dc.title | A Comprehensive Exploration of Video Understanding: Perspectives on Sampling, Backbone, Representation, and Cross-Modal Learning | en_AU |
dc.type | Thesis | |
dc.type.thesis | Doctor of Philosophy | en_AU |
dc.rights.other | The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission. | en_AU |
usyd.faculty | SeS faculties schools::Faculty of Engineering::School of Computer Science | en_AU |
usyd.degree | Doctor of Philosophy Ph.D. | en_AU |
usyd.awardinginst | The University of Sydney | en_AU |
usyd.advisor | Xu, Chang | |
usyd.include.pub | No | en_AU |
Associated file/s
Associated collections