Visual Scene Understanding through Scene Graph Generation and Joint Learning
Access status:
Open Access
Type
ThesisThesis type
Masters by ResearchAuthor/s
Bui, Anh DucAbstract
Deep visual scene understanding is an essential part for the development of high-level visual understanding tasks such as storytelling or Visual Question Answering. One of the proposed solutions for such purposes were Scene Graphs, with the capacity to represent the semantic details ...
See moreDeep visual scene understanding is an essential part for the development of high-level visual understanding tasks such as storytelling or Visual Question Answering. One of the proposed solutions for such purposes were Scene Graphs, with the capacity to represent the semantic details of images into abstract elements using a graph structure which is both suitable for machine processing as well as human understanding. However, automatically generating reasonable and informative scene graphs remains a challenge due to the problem of long tail biases present in the annotated data available. Therefore, the goal of the thesis focuses on the generation of scene graph from images for the visual understanding in two main aspects: how scene graph can be generated with object predicates that are both reasonable with human understanding and informative enough for usage of further computer vision usage and how joint learning can be applied in the scene graph generation pipeline to further improve the quality of the output scene graph. For the first end, we addressed the problem in the scene graph generation task where uncorrelated labels are classified against each other, in which we tackled by categorising correlated labels and learning category-specific predicate features. For the second end, a shuffle transformer is proposed as a method for jointly learning the category specific features to generate a more robust and informative universal predicate feature which is used to generate better predicate labels for the scene graph. The performance of the proposed methodology is then evaluated in comparison with state-of-the-art scene graph generation methods in the fields by using mean recall metric on the subset Visual Genome which was most commonly used for scene graph generation.
See less
See moreDeep visual scene understanding is an essential part for the development of high-level visual understanding tasks such as storytelling or Visual Question Answering. One of the proposed solutions for such purposes were Scene Graphs, with the capacity to represent the semantic details of images into abstract elements using a graph structure which is both suitable for machine processing as well as human understanding. However, automatically generating reasonable and informative scene graphs remains a challenge due to the problem of long tail biases present in the annotated data available. Therefore, the goal of the thesis focuses on the generation of scene graph from images for the visual understanding in two main aspects: how scene graph can be generated with object predicates that are both reasonable with human understanding and informative enough for usage of further computer vision usage and how joint learning can be applied in the scene graph generation pipeline to further improve the quality of the output scene graph. For the first end, we addressed the problem in the scene graph generation task where uncorrelated labels are classified against each other, in which we tackled by categorising correlated labels and learning category-specific predicate features. For the second end, a shuffle transformer is proposed as a method for jointly learning the category specific features to generate a more robust and informative universal predicate feature which is used to generate better predicate labels for the scene graph. The performance of the proposed methodology is then evaluated in comparison with state-of-the-art scene graph generation methods in the fields by using mean recall metric on the subset Visual Genome which was most commonly used for scene graph generation.
See less
Date
2023Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering, School of Computer ScienceAwarding institution
University of SydneyShare