Keywords

1 Introduction

Recently, semantic segmentation has gained a huge attention in the field of computer vision. One of the main applications is autonomous driving where the car is able to understand the environment by providing a class for each pixel in the scene and consequently has the ability to react accordingly [14]. In this work, we investigate the usage of geometric cues to improve accuracy of semantic segmentation.

Most of semantic segmentation algorithms mainly rely on appearance cues and do not exploit geometry related information. In this paper, we investigate usage of depth as a geometric cue for semantic segmentation task in autonomous driving application where there is a strong geometric structure. The road surface is typically flat and all the objects stand vertical on it. This is exploited explicitly in the formulation of a commonly used depth representation namely Stixels [6]. The contributions of this work include:

  1. 1.

    Detailed study of the impact of depth for segmentation in automated driving.

  2. 2.

    Systematic study of fusing RGB and Depth on semantic segmentation using four CNN networks.

  3. 3.

    Experimentation on two automotive datasets namely Virtual KITTI and Cityscapes.

The rest of the paper is organized as follows: Sect. 2 reviews the related work in segmentation, depth computation and role of depth in semantic segmentation. Section 3 illustrates the details of our four architectures to systematically study the effect of fusing depth with appearance for semantic segmentation. Section 4 discusses the experimental results in Virtual KITTI and Cityscapes. Finally, Sect. 5 provides concluding remarks.

2 Related Work

2.1 Semantic Segmentation

Siam et al. [25] presented a detailed survey on automated driving particularly for semantic segmentation. The advancement of semantic segmentation until the present can be categorically discussed in three phases. It started with patch-wise training as reported in [8] for classification. [8] proposed multi-scale pyramid processing through 3-stage network followed by a classical segmentation approach as post processing. Grangier et al. [11] proposed a pixel level classification approach using deep network to avoid post processing but it could not remove patch-wise training.

Next level of progress was pixel-wise classification through end-to-end learning as reported in [1, 18, 22]. Fully convolutional network (FCN) [18] was the first deep learning based technique that did not use patch-wise training, rather it directly learned from the heatmaps. Series of upsampling layers were used to obtain the dense predictions. Later deconvolution layer was proposed in Segnet [1] in place of unpooling layer. Introduction of skip connection from encoder to decoder was another contribution to this work for output reconstruction.

Recently feature extraction from multi-scale input has been heavily explored and can be found in [4, 8, 22,23,24, 31]. Though [8] used feature maps from encoder using skip connections to merge heatmaps from different resolution but space reduction in encoder side hurt the final prediction. U-net [24] pools encoded feature maps from initial layers that are concatenated with the decoded feature maps and upsampled for the next layers. To avoid loss of resolution, broadening the receptive field by applying dilated convolution has shown better results.

2.2 Depth in Automated Driving Systems

Depth estimation is very critical for automated driving. Having image semantics without localization is seldom useful. In a typical automated driving pipeline, depth is already computed and can be leveraged for semantic segmentation. In this sub-section, we summarize the different mechanisms by how depth can be estimated.

Classical Geometric Approach. Dense depth is computed to understand the spatial geometry of the scene. Stereo cameras have been commonly used in front camera automated driving systems. Disparity estimation methods using classical geometric matching algorithms are quite mature. Alternatively, Structure From Motion (SFM) approaches can be used for monocular cameras. But they suffer from issues like handling moving objects, focus on expansion, etc. Accurate Depth could be useful for semantic segmentation and could be passed on as an extra channel. However, SFM estimates are quite noisy and also the algorithm variations over time could affect the training of the network. But in [2] some cues from the noisy point-cloud was inferred to act as features for segmentation. The cues proposed were: height above the camera, distance to the camera path, projected surface orientation, feature track density, and residual reconstruction error. The work in [16] proposed a way of jointly estimating the semantic segmentation and structure from motion in a conditional random field formulation.

CNN Based Depth Estimation. In recent years, several CNN-based monocular depth estimation approaches are trained in a supervised way which requires a single input image with no assumptions about the scene geometry or types of objects which are present. For autonomous driving application, unsupervised methods are very beneficial due to the lack of reliable annotated datasets that have depth maps provided for outdoor driving scenes. Unsupervised depth estimation is an open point of research. [32] used temporal information of video sequence to capture depth while [11] referred to as “monoDepth” used left-right consistency for stereo images to train the network while the depth is estimated from monocular images in inference. We exploit this approach to generate depth maps for both Virtual KITTI, and Cityscapes datasets in our experiments.

LIDAR Sensors. LIDAR sensors provide depth estimation with better accuracy and range compared to camera based estimation algorithms. However, their measurements are sparse in the image lattice as illustrated in Fig. 1. This leads to problems in learning a dense convolutional neural networks features directly and requires handling of sparsity [28]. But they can be fused with camera based dense depth. The method in [21] fused a sparse LIDAR for semantic segmentation using elastic fusion [30]. Generally, this is a good research problem to be pursued as LIDAR is becoming a standard sensor in automated driving systems.

Fig. 1.
figure 1

Visualization of depth estimation (top) in automated driving scenes, adapted from [20]. It illustrates the output of a commonly used depth estimation algorithm called SGM we use in this paper and CNN based depth estimation which is closer to ground truth. Velodyne LIDAR depth re-projected on to a wide-angle image frame (bottom) to illustrate the level of sparsity.

2.3 Usage of Depth in Semantic Segmentation

FuseNet [12] is quite close to the work in this paper. They show that concatenating RGB and Depth slightly degrades mean IoU while the two-stream approach improves mena IoU by 3.65% in SUN RGB-D dataset. Ma et al. [19] combine depth and RGB for multi-view semantic segmentation where depth was leveraged to re-warp different views. Lin et al. [17] uses FCN based cascaded feature network with branch predictors and show an improvement of 2% in IoU compared to RGB baseline in NYU dataset. A detailed empirical study on role of depth for semantic segmentation and object detection was done in [3] and they show 2% improvement in IoU in VOC2012 dataset. Weiyue et al. [29] incorporate depth aware architecture design and obtain a larger improvement of 10% IoU in NYU dataset.

Apart from color, depth is another dimension and its influence for semantic segmentation task is relatively less explored. Above mentioned works that use RGB-D cameras are mainly focused for indoor scenes. On the other hand, different road conditions, diverse lighting states and presence of dense shadow make the automotive scenes very challenging for semantic segmentation however, better geometric structure for the scene is one thing to be exploited. From the extensive literature study, it appears that there is no systematic study done on the influence of depth for automotive scenes and this motivated our work.

3 Semantic Segmentation Models

In this section, the four architectures used in this paper are illustrated. (Figure 2(c)) shows RGBD network which is based on concatenation of RGB image and Depth map as a four layer input. (Figure 2(d)) shows the two stream RGB+D network. RGB-only and Depth-only are shown in (Fig. 2 (a), (b)), and they are used as baselines for comparison.

Fig. 2.
figure 2

Four types of architectures constructed and tested in the paper. (a) and (b) are baselines using RGB and Depth only. (c) and (d) are depth augmented semantic segmentation architectures.

Table 1. Quantitative analysis of our four networks on Virtual KITTI dataset.
Table 2. Semantic Segmentation Results (Mean IoU) on Virtual KITTI dataset (GT - Ground Truth, mD - monoDepth)
Table 3. Semantic Segmentation Results (Mean IoU) on Cityscapes dataset (SGM - Semi Global Matching, mD - monoDepth)

3.1 One-Stream Networks

This network is based on FCN8s [18] architecture and it’s used in our RGB-only and Depth-only experiments. The fully connected layers of the VGG16 are changed to a fully convolutional network where the first 15 convolutional layers are used for feature extraction. The output segmentation decoder follows the FCN architecture where 1 \(\times \) 1 convolutional layer is used followed by three transposed convolution layers for up-sampling. Introduction of skip connections within encoder was not tried as residual learning is not much effective for smaller networks as shown in [7]. Skip connections from encoder to decoder are exploited to extract high resolution features from the lower layers which are added to the upsampled feature maps. The loss function used for semantic segmentation is illustrated below.

$$\begin{aligned} L= - \frac{1}{|I|} \sum _{i \in I} \sum _{c \in C_{Dataset}} p_i(c)\log {q_i(c)} \end{aligned}$$
(1)

where q denotes predictions and p denotes ground-truth. \(C_{Dataset}\) is the set of classes for the used dataset.

3.2 RGBD Network

Four channels which are the original RGB image layers concatenated with the depth map are used as an input to the network, where depth layer is normalized from 0 to 255 to have the same value range as the RGB. The VGG pretrained weights are utilized, however the first layer is changed so that it accepts an input of four channels, where the corresponding weights are initialized randomly. Depth map Ground Truth is used in the case of Virtual KITTI to eliminate the errors due to depth estimation algorithms. For Cityscapes, disparity maps computed using SGM algorithm [13] are exploited where, it is a commonly used depth estimation algorithm in automated driving.

3.3 Two Stream (RGB+D) Network

Inspired from [15, 26, 27], a two-stream network using two VGG6 encoders is used, where each encoder processes a different input. One for the RGB input, and the other for the depth map. Fusion between feature maps from both encoders is done using two approaches. The first one is the usage of summation junction (RGB+D Add), while the other is concatenation instead of summation (RGB+D concat). By concatenation, depth dimension of the feature vector is doubled, however we aim to give the network more flexibility to learn more complex fusion approach to improve result. Afterwards, The same decoder used in the one-stream network is used for upsampling

4 Experiments

In this section, we present the datasets used, experimental setup and results.

4.1 Datasets

We choose two datasets namely Virtual KITTI [9] and Cityscapes [5] since they contain outdoor road scenes and this is consistent with our application as it is focused on automated driving. Additionally, Virtual KITTI provides perfect Ground Truth Depth annotation which helps us to evaluate our algorithm excluding the depth calculation errors. Typically, depth is calculated in automotive applications using stereo images or structure from motion. We utilize Cityscapes depth maps which are based on SGM algorithm using stereo images. Virtual KITTI is a synthetic dataset that consists of 21,260 frames containing road scenes in an urban environments in different weather conditions. We exploit both depth and semantic segmentation annotations. Cityscapes is a well known dataset containing real images of road scenes. It consists of 20000 images having coarse semantic segmentation annotation and 5000 having fine annotation. We only use the fine annotations in our experiments and we intentionally use noisy SGM depth to understand the effects of relatively noise depth estimations, and we provide evaluation using IoU metric on the validation set that contains 500 frames.

4.2 Experimental Setup

We have used Virtual KITTI and Cityscapes dataset where the dimension of each image is \(375 \times 1242\) and \(1024 \times 2048\) (later down-scaled to \(512 \times 1024\) during training) respectively. For all experiments, we transferred the encoder weights of VGG pre-trained model on ImageNet for the segmentation task. Transfer learning helped us to get better initialization of the encoder at the beginning of the joint encoder-decoder training for semantic segmentation. Dropout with probability 0.5 is used in our model particularly for \(1 \times 1\) convolutional layers. Very popular Adam optimizer is used with an initial learning rate of \(1e^{-5}\) along with L2 regularization in the loss function and a factor of \(5e^{-4}\) to avoid over-fitting. To evaluate the efficacy of our proposal, widely used Intersection over Union (IoU) is measured for both datasets, also precision, recall and F-score are used for Virtual KITTI dataset.

4.3 Experimental Results

We provide qualitative evaluation on both datasets separately using IoU metric as shown in Tables 1, 2 and 3. Video links of the four architectures results are also provided for both datasets. In addition to depth annotations provided with the datasets, we generated depth maps using unsupervised approach [10] for both datasets and compared results against Ground Truth in Virtual KITTI, and noisy SGM in Cityscapes.

Fig. 3.
figure 3

Qualitative comparison of semantic segmentation outputs from four architectures on VKITTI dataset using GT

Fig. 4.
figure 4

Qualitative comparison of semantic segmentation outputs using monoDepth [10] estimator

Table 1 illustrates that depth augmentation consistently improves results in all four metrics reported. An improvement of 5.7% in IoU, 3.8% in Precision, 4.36% in Recall and 5.9% in F-score is shown. Class-wise evaluation is listed in Table 2. Although the overall improvement is incremental, there is a large improvement for certain classes, for example, trucks, van, Building and Traffic Lights show an improvement of 32%, 28%, 9% and 8% respectively. Cityscapes results are reported in Table 3, and it shows a relatively moderate improvement of 1% in IoU. However, results show that even noisy depth maps with invalid values due to depth estimation errors can improve semantic segmentation.

Fig. 5.
figure 5

Qualitative comparison of semantic segmentation outputs from four architectures on Cityscapes dataset using SGM depth estimator

Fig. 6.
figure 6

Qualitative comparison of semantic segmentation output from monoDepth estimator

In summary, the network that concatenates depth with RGB feature maps shows better results than others as observed on VKITTI. As per the results on CityScapes, the impact of feature map concatenation and addition with RGB is fairly close, however results from monoDepth [10] estimator using concatenated feature maps happened to outperform added feature maps. Qualitative results of all four proposals are demonstrated in on Virtual KITTI (Figs. 3 and 4) and Cityscapes (Figs. 5 and 6).

Test results of both datasets are shared publicly on YouTube inFootnote 1 andFootnote 2. Depth-only network is reported to study the performance depth cue alone can do to semantic segmentation. Surprisingly, depth provides good results especially for road, vegetation, vehicle, and pedestrians. This is also consistent with the results obtained by [12] when depth only is tested for indoor scenes. We noticed that there is degradation of accuracy relative to RGB baseline whenever there is noisy depth. Hence next step would be to make more systematic evaluation of the depth that is loosely coupled within the network. It is observed that the joint network has outperformed depth only network with a negligible margin, perhaps the network does not really know how to learn these two completely different cues and thus these two modalities are not logically fused in the network. Our future plan includes construction of multi-modal architectures to achieve better amalgamation of heterogeneous cues.

5 Conclusion

In this paper, we focused on the impact of a relatively unexplored cue that is depth for semantic segmentation task. We designed four different segmentation networks that receive input as RGB only, depth only, RGBD concatenated and two-stream RGB and depth. Our experimental results of four models on two automotive datasets namely Virtual KITTI and Cityscapes demonstrate a reasonable improvement in overall accuracy and good improvement for a few specific classes for the network that uses simple depth augmentation. We believe the present study furnishes adequate evidence on the impact of the depth for accurate semantic segmentation. In future work, we build a better depth aware more robust model to fully utilize its complementary nature.