Data-efficient Visual Recognition and Localization
| Field | Value | Language |
| dc.contributor.author | Wei, Fangyun | |
| dc.date.accessioned | 2026-03-29T21:33:51Z | |
| dc.date.available | 2026-03-29T21:33:51Z | |
| dc.date.issued | 2026 | en |
| dc.identifier.uri | https://hdl.handle.net/2123/35039 | |
| dc.description.abstract | Over the past decade, computer vision has progressed from visual understanding to visual generation. Visual understanding extracts semantic and geometric information from images and videos and supports applications such as autonomous driving, robotics, medical imaging, surveillance, and augmented reality. Object detection and segmentation are foundational visual understanding tasks. Detection localizes and classifies objects, while segmentation provides pixel-level predictions, including semantic and instance segmentation. Extending these tasks to videos introduces challenges such as motion blur, occlusion, appearance variation, and temporal consistency. In particular, video instance segmentation requires detecting, segmenting, and tracking objects across frames. These tasks provide structured representations that serve as reusable building blocks for downstream reasoning and decision-making. Despite advances in model capacity, modern vision systems rely heavily on large-scale labeled data. However, acquiring high-quality annotations is costly and labor-intensive, especially for dense video masks or domain-specific scenarios. Long-tail distributions and domain shifts further increase data collection difficulty. To address this limitation, data-efficient learning aims to achieve strong performance with limited labeled data by leveraging unlabeled data through semi-supervised learning, pseudo-labeling, consistency regularization, and pretraining. In this thesis, we study data-efficient learning for visual understanding and validate its effectiveness on object detection, video object segmentation, and video instance segmentation. We develop practical frameworks that reduce annotation dependency while maintaining competitive accuracy, advancing scalable vision systems under realistic data constraints. | en |
| dc.language.iso | en | en |
| dc.subject | data-efficient learning | en |
| dc.subject | computer vision | en |
| dc.subject | object detection | en |
| dc.subject | video instance segmentation | en |
| dc.subject | video object segmentation | en |
| dc.title | Data-efficient Visual Recognition and Localization | en |
| dc.type | Thesis | |
| dc.type.thesis | Doctor of Philosophy | en |
| dc.rights.other | The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission. | en |
| usyd.faculty | SeS faculties schools::Faculty of Engineering::School of Computer Science | en |
| usyd.degree | Doctor of Philosophy Ph.D. | en |
| usyd.awardinginst | The University of Sydney | en |
| usyd.advisor | Xu, Chang | |
| usyd.include.pub | No | en |
Associated file/s
Associated collections