Abstract
In this paper, we explore the augmentation of depth maps to improve the performance of semantic segmentation motivated by the geometric structure in automotive scenes. Typically depth is already computed in an automotive system to localize objects and path planning and thus can be leveraged for semantic segmentation. We construct two networks that serve as a baseline for comparison which are “RGB only” and “Depth only”, and we investigate the impact of fusion of both cues using another two networks which are “RGBD concat”, and “Two Stream RGB+D”. We evaluate these networks on two automotive datasets namely Virtual KITTI using synthetic depth and Cityscapes using a standard stereo depth estimation algorithm. Additionally, we evaluate our approach using monoDepth unsupervised estimator [10]. Two-stream architecture achieves the best results with an improvement of 5.7% IoU in Virtual KITTI and 1% IoU in Cityscapes. There is a large improvement for certain classes like trucks, building, van and cars which have an increase of 29%, 11%, 9% and 8% respectively in Virtual KITTI. Surprisingly, CNN model is able to produce good semantic segmentation from depth images only. The proposed network runs at 4 fps on TitanX GPU, Maxwell architecture.
Similar content being viewed by others
Keywords
1 Introduction
Recently, semantic segmentation has gained a huge attention in the field of computer vision. One of the main applications is autonomous driving where the car is able to understand the environment by providing a class for each pixel in the scene and consequently has the ability to react accordingly [14]. In this work, we investigate the usage of geometric cues to improve accuracy of semantic segmentation.
Most of semantic segmentation algorithms mainly rely on appearance cues and do not exploit geometry related information. In this paper, we investigate usage of depth as a geometric cue for semantic segmentation task in autonomous driving application where there is a strong geometric structure. The road surface is typically flat and all the objects stand vertical on it. This is exploited explicitly in the formulation of a commonly used depth representation namely Stixels [6]. The contributions of this work include:
-
1.
Detailed study of the impact of depth for segmentation in automated driving.
-
2.
Systematic study of fusing RGB and Depth on semantic segmentation using four CNN networks.
-
3.
Experimentation on two automotive datasets namely Virtual KITTI and Cityscapes.
The rest of the paper is organized as follows: Sect. 2 reviews the related work in segmentation, depth computation and role of depth in semantic segmentation. Section 3 illustrates the details of our four architectures to systematically study the effect of fusing depth with appearance for semantic segmentation. Section 4 discusses the experimental results in Virtual KITTI and Cityscapes. Finally, Sect. 5 provides concluding remarks.
2 Related Work
2.1 Semantic Segmentation
Siam et al. [25] presented a detailed survey on automated driving particularly for semantic segmentation. The advancement of semantic segmentation until the present can be categorically discussed in three phases. It started with patch-wise training as reported in [8] for classification. [8] proposed multi-scale pyramid processing through 3-stage network followed by a classical segmentation approach as post processing. Grangier et al. [11] proposed a pixel level classification approach using deep network to avoid post processing but it could not remove patch-wise training.
Next level of progress was pixel-wise classification through end-to-end learning as reported in [1, 18, 22]. Fully convolutional network (FCN) [18] was the first deep learning based technique that did not use patch-wise training, rather it directly learned from the heatmaps. Series of upsampling layers were used to obtain the dense predictions. Later deconvolution layer was proposed in Segnet [1] in place of unpooling layer. Introduction of skip connection from encoder to decoder was another contribution to this work for output reconstruction.
Recently feature extraction from multi-scale input has been heavily explored and can be found in [4, 8, 22,23,24, 31]. Though [8] used feature maps from encoder using skip connections to merge heatmaps from different resolution but space reduction in encoder side hurt the final prediction. U-net [24] pools encoded feature maps from initial layers that are concatenated with the decoded feature maps and upsampled for the next layers. To avoid loss of resolution, broadening the receptive field by applying dilated convolution has shown better results.
2.2 Depth in Automated Driving Systems
Depth estimation is very critical for automated driving. Having image semantics without localization is seldom useful. In a typical automated driving pipeline, depth is already computed and can be leveraged for semantic segmentation. In this sub-section, we summarize the different mechanisms by how depth can be estimated.
Classical Geometric Approach. Dense depth is computed to understand the spatial geometry of the scene. Stereo cameras have been commonly used in front camera automated driving systems. Disparity estimation methods using classical geometric matching algorithms are quite mature. Alternatively, Structure From Motion (SFM) approaches can be used for monocular cameras. But they suffer from issues like handling moving objects, focus on expansion, etc. Accurate Depth could be useful for semantic segmentation and could be passed on as an extra channel. However, SFM estimates are quite noisy and also the algorithm variations over time could affect the training of the network. But in [2] some cues from the noisy point-cloud was inferred to act as features for segmentation. The cues proposed were: height above the camera, distance to the camera path, projected surface orientation, feature track density, and residual reconstruction error. The work in [16] proposed a way of jointly estimating the semantic segmentation and structure from motion in a conditional random field formulation.
CNN Based Depth Estimation. In recent years, several CNN-based monocular depth estimation approaches are trained in a supervised way which requires a single input image with no assumptions about the scene geometry or types of objects which are present. For autonomous driving application, unsupervised methods are very beneficial due to the lack of reliable annotated datasets that have depth maps provided for outdoor driving scenes. Unsupervised depth estimation is an open point of research. [32] used temporal information of video sequence to capture depth while [11] referred to as “monoDepth” used left-right consistency for stereo images to train the network while the depth is estimated from monocular images in inference. We exploit this approach to generate depth maps for both Virtual KITTI, and Cityscapes datasets in our experiments.
LIDAR Sensors. LIDAR sensors provide depth estimation with better accuracy and range compared to camera based estimation algorithms. However, their measurements are sparse in the image lattice as illustrated in Fig. 1. This leads to problems in learning a dense convolutional neural networks features directly and requires handling of sparsity [28]. But they can be fused with camera based dense depth. The method in [21] fused a sparse LIDAR for semantic segmentation using elastic fusion [30]. Generally, this is a good research problem to be pursued as LIDAR is becoming a standard sensor in automated driving systems.
2.3 Usage of Depth in Semantic Segmentation
FuseNet [12] is quite close to the work in this paper. They show that concatenating RGB and Depth slightly degrades mean IoU while the two-stream approach improves mena IoU by 3.65% in SUN RGB-D dataset. Ma et al. [19] combine depth and RGB for multi-view semantic segmentation where depth was leveraged to re-warp different views. Lin et al. [17] uses FCN based cascaded feature network with branch predictors and show an improvement of 2% in IoU compared to RGB baseline in NYU dataset. A detailed empirical study on role of depth for semantic segmentation and object detection was done in [3] and they show 2% improvement in IoU in VOC2012 dataset. Weiyue et al. [29] incorporate depth aware architecture design and obtain a larger improvement of 10% IoU in NYU dataset.
Apart from color, depth is another dimension and its influence for semantic segmentation task is relatively less explored. Above mentioned works that use RGB-D cameras are mainly focused for indoor scenes. On the other hand, different road conditions, diverse lighting states and presence of dense shadow make the automotive scenes very challenging for semantic segmentation however, better geometric structure for the scene is one thing to be exploited. From the extensive literature study, it appears that there is no systematic study done on the influence of depth for automotive scenes and this motivated our work.
3 Semantic Segmentation Models
In this section, the four architectures used in this paper are illustrated. (Figure 2(c)) shows RGBD network which is based on concatenation of RGB image and Depth map as a four layer input. (Figure 2(d)) shows the two stream RGB+D network. RGB-only and Depth-only are shown in (Fig. 2 (a), (b)), and they are used as baselines for comparison.
3.1 One-Stream Networks
This network is based on FCN8s [18] architecture and it’s used in our RGB-only and Depth-only experiments. The fully connected layers of the VGG16 are changed to a fully convolutional network where the first 15 convolutional layers are used for feature extraction. The output segmentation decoder follows the FCN architecture where 1 \(\times \) 1 convolutional layer is used followed by three transposed convolution layers for up-sampling. Introduction of skip connections within encoder was not tried as residual learning is not much effective for smaller networks as shown in [7]. Skip connections from encoder to decoder are exploited to extract high resolution features from the lower layers which are added to the upsampled feature maps. The loss function used for semantic segmentation is illustrated below.
where q denotes predictions and p denotes ground-truth. \(C_{Dataset}\) is the set of classes for the used dataset.
3.2 RGBD Network
Four channels which are the original RGB image layers concatenated with the depth map are used as an input to the network, where depth layer is normalized from 0 to 255 to have the same value range as the RGB. The VGG pretrained weights are utilized, however the first layer is changed so that it accepts an input of four channels, where the corresponding weights are initialized randomly. Depth map Ground Truth is used in the case of Virtual KITTI to eliminate the errors due to depth estimation algorithms. For Cityscapes, disparity maps computed using SGM algorithm [13] are exploited where, it is a commonly used depth estimation algorithm in automated driving.
3.3 Two Stream (RGB+D) Network
Inspired from [15, 26, 27], a two-stream network using two VGG6 encoders is used, where each encoder processes a different input. One for the RGB input, and the other for the depth map. Fusion between feature maps from both encoders is done using two approaches. The first one is the usage of summation junction (RGB+D Add), while the other is concatenation instead of summation (RGB+D concat). By concatenation, depth dimension of the feature vector is doubled, however we aim to give the network more flexibility to learn more complex fusion approach to improve result. Afterwards, The same decoder used in the one-stream network is used for upsampling
4 Experiments
In this section, we present the datasets used, experimental setup and results.
4.1 Datasets
We choose two datasets namely Virtual KITTI [9] and Cityscapes [5] since they contain outdoor road scenes and this is consistent with our application as it is focused on automated driving. Additionally, Virtual KITTI provides perfect Ground Truth Depth annotation which helps us to evaluate our algorithm excluding the depth calculation errors. Typically, depth is calculated in automotive applications using stereo images or structure from motion. We utilize Cityscapes depth maps which are based on SGM algorithm using stereo images. Virtual KITTI is a synthetic dataset that consists of 21,260 frames containing road scenes in an urban environments in different weather conditions. We exploit both depth and semantic segmentation annotations. Cityscapes is a well known dataset containing real images of road scenes. It consists of 20000 images having coarse semantic segmentation annotation and 5000 having fine annotation. We only use the fine annotations in our experiments and we intentionally use noisy SGM depth to understand the effects of relatively noise depth estimations, and we provide evaluation using IoU metric on the validation set that contains 500 frames.
4.2 Experimental Setup
We have used Virtual KITTI and Cityscapes dataset where the dimension of each image is \(375 \times 1242\) and \(1024 \times 2048\) (later down-scaled to \(512 \times 1024\) during training) respectively. For all experiments, we transferred the encoder weights of VGG pre-trained model on ImageNet for the segmentation task. Transfer learning helped us to get better initialization of the encoder at the beginning of the joint encoder-decoder training for semantic segmentation. Dropout with probability 0.5 is used in our model particularly for \(1 \times 1\) convolutional layers. Very popular Adam optimizer is used with an initial learning rate of \(1e^{-5}\) along with L2 regularization in the loss function and a factor of \(5e^{-4}\) to avoid over-fitting. To evaluate the efficacy of our proposal, widely used Intersection over Union (IoU) is measured for both datasets, also precision, recall and F-score are used for Virtual KITTI dataset.
4.3 Experimental Results
We provide qualitative evaluation on both datasets separately using IoU metric as shown in Tables 1, 2 and 3. Video links of the four architectures results are also provided for both datasets. In addition to depth annotations provided with the datasets, we generated depth maps using unsupervised approach [10] for both datasets and compared results against Ground Truth in Virtual KITTI, and noisy SGM in Cityscapes.
Table 1 illustrates that depth augmentation consistently improves results in all four metrics reported. An improvement of 5.7% in IoU, 3.8% in Precision, 4.36% in Recall and 5.9% in F-score is shown. Class-wise evaluation is listed in Table 2. Although the overall improvement is incremental, there is a large improvement for certain classes, for example, trucks, van, Building and Traffic Lights show an improvement of 32%, 28%, 9% and 8% respectively. Cityscapes results are reported in Table 3, and it shows a relatively moderate improvement of 1% in IoU. However, results show that even noisy depth maps with invalid values due to depth estimation errors can improve semantic segmentation.
In summary, the network that concatenates depth with RGB feature maps shows better results than others as observed on VKITTI. As per the results on CityScapes, the impact of feature map concatenation and addition with RGB is fairly close, however results from monoDepth [10] estimator using concatenated feature maps happened to outperform added feature maps. Qualitative results of all four proposals are demonstrated in on Virtual KITTI (Figs. 3 and 4) and Cityscapes (Figs. 5 and 6).
Test results of both datasets are shared publicly on YouTube inFootnote 1 andFootnote 2. Depth-only network is reported to study the performance depth cue alone can do to semantic segmentation. Surprisingly, depth provides good results especially for road, vegetation, vehicle, and pedestrians. This is also consistent with the results obtained by [12] when depth only is tested for indoor scenes. We noticed that there is degradation of accuracy relative to RGB baseline whenever there is noisy depth. Hence next step would be to make more systematic evaluation of the depth that is loosely coupled within the network. It is observed that the joint network has outperformed depth only network with a negligible margin, perhaps the network does not really know how to learn these two completely different cues and thus these two modalities are not logically fused in the network. Our future plan includes construction of multi-modal architectures to achieve better amalgamation of heterogeneous cues.
5 Conclusion
In this paper, we focused on the impact of a relatively unexplored cue that is depth for semantic segmentation task. We designed four different segmentation networks that receive input as RGB only, depth only, RGBD concatenated and two-stream RGB and depth. Our experimental results of four models on two automotive datasets namely Virtual KITTI and Cityscapes demonstrate a reasonable improvement in overall accuracy and good improvement for a few specific classes for the network that uses simple depth augmentation. We believe the present study furnishes adequate evidence on the impact of the depth for accurate semantic segmentation. In future work, we build a better depth aware more robust model to fully utilize its complementary nature.
References
Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015)
Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88682-2_5
Cao, Y., Shen, C., Shen, H.T.: Exploiting depth from single monocular images for object detection and semantic segmentation. IEEE Trans. Image Process. 26(2), 836–846 (2017)
Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: Scale-aware semantic image segmentation. arXiv preprint arXiv:1511.03339 (2015)
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. arXiv preprint arXiv:1604.01685 (2016)
Cordts, M., et al.: The stixel world: a medium-level representation of traffic scenes. Image Vis. Comput. 68, 40–52 (2017)
Das, A., Yogamani, S.: Evaluation of residual learning in lightweight deep networks for object classification. In: Proceedings of the Irish Machine Vision and Image Processing Conference, pp. 205–208 (2018)
Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013)
Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: CVPR (2016)
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, vol. 2, p. 7 (2017)
Grangier, D., Bottou, L., Collobert, R.: Deep convolutional networks for scene parsing. In: ICML 2009 Deep Learning Workshop, vol. 3. Citeseer (2009)
Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10111, pp. 213–228. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54181-5_14
Hirschmuller, H.: Accurate and efficient stereo processing by semi-global matching and mutual information. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition CVPR 2005, vol. 2, pp. 807–814. IEEE (2005)
Horgan, J., Hughes, C., McDonald, J., Yogamani, S.: Vision-based driver assistance systems: survey, taxonomy and advances. In: 2015 IEEE 18th International Conference on. Intelligent Transportation Systems (ITSC), pp. 2032–2039. IEEE (2015)
Jain, S.D., Xiong, B., Grauman, K.: Fusionseg: learning to combine motion and appearance for fully automatic segmention of generic objects in videos. arXiv preprint arXiv:1701.05384 (2017)
Kundu, A., Li, Y., Dellaert, F., Li, F., Rehg, J.M.: Joint semantic segmentation and 3D reconstruction from monocular video. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 703–718. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_45
Lin, D., Chen, G., Cohen-Or, D., Heng, P.A., Huang, H.: Cascaded feature network for semantic segmentation of RGB-D images. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1320–1328. IEEE (2017)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Ma, L., Stückler, J., Kerl, C., Cremers, D.: Multi-view deep learning for consistent semantic mapping with RGB-D cameras. arXiv preprint arXiv:1703.08866 (2017)
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048 (2016)
McCormac, J., Handa, A., Davison, A., Leutenegger, S.: Semanticfusion: dense 3D semantic mapping with convolutional neural networks. arXiv preprint arXiv:1609.05130 (2016)
Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1520–1528 (2015)
Qi, G.J.: Hierarchically gated deep networks for semantic segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Siam, M., Elkerdawy, S., Jagersand, M., Yogamani, S.: Deep semantic segmentation for automated driving: taxonomy, roadmap and challenges. arXiv preprint arXiv:1707.02432 (2017)
Siam, M., Mahgoub, H., Zahran, M., Yogamani, S., Jagersand, M., El-Sallab, A.: MODNET: moving object detection network with motion and appearance for autonomous driving. arXiv preprint arXiv:1709.04821 (2017)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. arXiv preprint arXiv:1708.06500 (2017)
Wang, W., Neumann, U.: Depth-aware CNN for RGB-D segmentation. arXiv preprint arXiv:1803.06791 (2018)
Whelan, T., Leutenegger, S., Salas-Moreno, R.F., Glocker, B., Davison, A.J.: Elasticfusion: Dense slam without a pose graph. In: Robotics: Science and Systems, vol. 11 (2015)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Rashed, H., Yogamani, S., El-Sallab, A., Das, A., El-Helw, M. (2019). Depth Augmented Semantic Segmentation Networks for Automated Driving. In: Arora, C., Mitra, K. (eds) Computer Vision Applications. WCVA 2018. Communications in Computer and Information Science, vol 1019. Springer, Singapore. https://doi.org/10.1007/978-981-15-1387-9_1
Download citation
DOI: https://doi.org/10.1007/978-981-15-1387-9_1
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1386-2
Online ISBN: 978-981-15-1387-9
eBook Packages: Computer ScienceComputer Science (R0)