Dense prediction or pixel-level labeling targets at predicting labels of interest (e.g., categories, depth values, flow vectors, and edge probabilities) for each pixel of an input image. This middle-level computer vision problem plays a crucial role in establishing visual perception systems for the future intelligent world. Therefore, tremendous efforts have been made in the past decades to explore the solution for robust dense prediction, and recent studies have continuously obtained significant progress relying on deep Fully Convolutional Networks (FCNs). According to the expected label, dense prediction contains a set of subtasks. Building robust models for each task must examine the particular property, but the main intuition and motivation for the network architecture development are shared across different tasks.
In the thesis, we take the well-known problems of scene parsing, monocular depth estimation, and edge detection as examples, and devise some advanced and highly extensible techniques by addressing both the individual and collective issues for robust dense prediction. Specific to scene parsing, employing hierarchical convolutional features is essential to obtain high-resolution and fine-grained predictions. Previous algorithms regularly aggregate them via concatenation or linear combination, which cannot sufficiently exploit the diversities of the contextual information and the spatial inhomogeneity of a scene. We propose some novel attention mechanisms, i.e., adaptive hierarchical feature aggregation (AHFA) and mixture-of-experts (MoE), to re-weight different levels of features at each spatial location according to the local structure and surrounding contextual information before aggregation. Existing works on depth estimation often overlook the strong inherent ordinal correlation of depth values resulting in inferior performance. Motivated by the observation, we introduce the ranking mechanism for depth estimation by proposing an effective ordinal regression constraint. For edge detection, common approaches simply predict the boundary probability for each pixel individually from the receptive fields where the pixel is centered at. Differently, we propose that modeling the boundary structures or position sensitive scores are more flexible because of the implied feature competition for the prediction of each spatial position.
We also study unsupervised domain mapping which is of general applicability, enabling a consolidated solution for dense prediction. Advanced unsupervised domain mapping approaches mainly rely on Generative Adversarial Networks (GANs) to make the prediction indistinguishable from reality (e.g., generated pseudo parsing vs. truth parsing), and reduce the solution space with high-level constraints and assumptions to guarantee that an input and the corresponding output are paired up in a meaningful way in the absence of unmatched training samples. However, they overlook the special property of images that simple geometric transformations do not change the semantics of an image. With that motivation, we propose to enforce geometry consistency as a constraint and demonstrate that it can largely eliminate unreasonable mappings and produce more reliable solutions.