Monocular Segment-Wise Depth: Monocular Depth Estimation based on a Semantic Segmentation Prior

This work proposes a monocular depth estimation approach, which employs a jointly-trained pixel-wise semantic understanding step to estimate depth for individually-selected groups of objects (segments) within the scene. The separate depth outputs are efficiently fused to generate the final result. This creates more simplistic learning objectives for the jointly-trained individual networks, leading to more accurate overall depth. In this work, we propose a model that estimates scene depth based on a single RGB image by first semantically understanding the scene and then using the knowledge to generate depth for carefully selected segments, i.e. groups of scene objects. These generated segment-wise depth images are subsequently fused by means of a simple summation operation and the overall consistency is controlled by means of an adversarial training procedure as the final step. Such a training process is enabled by the use of a large-scale publicly-available dataset of synthetic images, SYNTHIA, which contains pixel-wise ground truth semantic object labels as well as pixel-perfect synthetic depth. Extensive experimentation demonstrates the efficacy of the proposed approach compared to contemporary state-of-the-art techniques within the literature.

Proposed Approach

The approach is designed to estimate depth for separate object groups that cover the entire scene when put together. Based on empirical analysis, we opt for decomposing any scene captured within an urban driving scenario into four object groups:

small and narrow foreground objects (e.g., pedestrians, road signs, cars)
flat surfaces (e.g., roads, buildings)
vegetation (e.g., trees, bushes)
background objects forming the remaining of the scene (other often unlabelled objects, e.g., a bench on the pavement).

Given an input image, each group is segmented using a separate segmentation network, the outputs of which are object groups that are subsequently employed to choose sections of the RGB image passed as inputs to depth generators, producing depth information for each object group. The entire model is trained as one entity, end to end. Such a training process is made possible using the SYNTHIA dataset, in which both ground truth depth and pixel-wise segmentation labels are available for video sequences of urban driving scenarios.

The training and inference pipeline of the proposed approach.

For the segmentation networks, a simple and efficient fully-supervised training procedure is used. The RGB image is used as the input to all the segmentation networks and each segmentation network outputs class labels for its specific object group. A sigmoid function along with binary cross-entropy is used as the loss function for each network. Monocular depth estimation, on the other hand, is considered as a supervised image-to-image mapping problem, wherein an input RGB image is translated into a depth image using a simple reconstruction mechanism (minimising the Euclidean distance between the pixel values of the output and the ground truth depth), which forces the model to generate images that are structurally and contextually close to the ground truth. Once depth is estimated for each individual segment, the final depth output is obtained by simply summing the partial depth images generated for each segment. With a simple linear operation such as above, there is always the possibility of undesirable artefacts, such as stitching effects, over-saturation of depth values and depth inhomogeneity, being introduced in the results. To prevent such issues, an adversarial loss component is introduced to ensure the overall consistency of the final depth image. The result of the summation operation is used as the input to a discriminator, along with the ground depth overall depth. The gradients from this discriminator are used in the overall training of the entire model. For further details of the approach, please refer to the paper.

Examples of the results of the proposed approach.

Technical Implementation

All implementation is carried out using Python, PyTorch and OpenCV. All training and experiments were performed using a GeForce GTX 1080 Ti.

Supplementary Video

For further details of the proposed approach and the results, please watch the video created as part of the supplementary material for the paper.

Publication

Paper:

DOI: 10.1109/ICIP.2019.8803551

Citation:

Amir Atapour-Abarghouei and Toby P. Breckon. "Monocular Segment-Wise Depth: Monocular Depth Estimation based on a Semantic Segmentation Prior", in IEEE International Conference on Image Processing (ICIP), 2019.
BibTeX

View More Research