Show simple item record

FieldValueLanguage
dc.contributor.authorWu, Wenhao
dc.date.accessioned2025-02-11T22:24:37Z
dc.date.available2025-02-11T22:24:37Z
dc.date.issued2025en_AU
dc.identifier.urihttps://hdl.handle.net/2123/33613
dc.description.abstractVideo understanding is a fundamental area in computer vision with applications in autonomous driving, security, healthcare, and entertainment. As video becomes a dominant medium for information exchange, automatic interpretation is increasingly essential. While action recognition has been the foundation of video understanding, recent multimodal advancements have expanded the field to tasks like video-text matching and video question answering. This thesis explores key advancements across multiple dimensions. First, it optimizes video recognition through salient frame selection, introducing Non-saliency Suppression Network (NSNet) to enhance efficiency and accuracy. It then investigates video model backbones, proposing the Arithmetic Temporal Module (ATM)—a plug-and-play component for temporal modeling compatible with CNNs and vision transformers. In self-supervised video representation learning, the Macro-to-Micro Semantic Correspondence (MaMiCo) task improves learning in the absence of labeled data. Moving toward weakly supervised learning, the Text4Vis framework adapts vision-language models for video recognition by leveraging text embeddings as classifiers, enhancing zero-/few-shot recognition. To incorporate real-world textual metadata, Cap4Video demonstrates that auxiliary captions improve text-video retrieval and recognition. Lastly, addressing the gap in video-based multimodal large language models (MLLMs), this thesis introduces Dense Connector, a plug-and-play module that enhances vision-language integration, and FreeVA, a training-free extension of image-based MLLMs to video, achieving state-of-the-art performance. These contributions collectively advance video understanding, providing insights and practical approaches for future research and applications.en_AU
dc.language.isoenen_AU
dc.subjectVideo Understandingen_AU
dc.subjectMultimodal Learningen_AU
dc.subjectVideo Representation Learningen_AU
dc.subjectAction Recognitionen_AU
dc.subjectTemporal Modelingen_AU
dc.subjectCross-Modal Learningen_AU
dc.titleA Comprehensive Exploration of Video Understanding: Perspectives on Sampling, Backbone, Representation, and Cross-Modal Learningen_AU
dc.typeThesis
dc.type.thesisDoctor of Philosophyen_AU
dc.rights.otherThe author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.en_AU
usyd.facultySeS faculties schools::Faculty of Engineering::School of Computer Scienceen_AU
usyd.degreeDoctor of Philosophy Ph.D.en_AU
usyd.awardinginstThe University of Sydneyen_AU
usyd.advisorXu, Chang
usyd.include.pubNoen_AU


Show simple item record

Associated file/s

Associated collections

Show simple item record

There are no previous versions of the item available.