Abstract
In this paper, we explore the augmentation of depth maps to improve the performance of semantic segmentation motivated by the geometric structure in automotive scenes. Typically depth is already computed in an automotive system to localize objects and path planning and thus can be leveraged for semantic segmentation. We construct two networks that serve as a baseline for comparison which are “RGB only” and “Depth only”, and we investigate the impact of fusion of both cues using another two networks which are “RGBD concat”, and “Two Stream RGB+D”. We evaluate these networks on two automotive datasets namely Virtual KITTI using synthetic depth and Cityscapes using a standard stereo depth estimation algorithm. Additionally, we evaluate our approach using monoDepth unsupervised estimator [10]. Two-stream architecture achieves the best results with an improvement of 5.7% IoU in Virtual KITTI and 1% IoU in Cityscapes. There is a large improvement for certain classes like trucks, building, van and cars which have an increase of 29%, 11%, 9% and 8% respectively in Virtual KITTI. Surprisingly, CNN model is able to produce good semantic segmentation from depth images only. The proposed network runs at 4 fps on TitanX GPU, Maxwell architecture.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015)
Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88682-2_5
Cao, Y., Shen, C., Shen, H.T.: Exploiting depth from single monocular images for object detection and semantic segmentation. IEEE Trans. Image Process. 26(2), 836–846 (2017)
Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: Scale-aware semantic image segmentation. arXiv preprint arXiv:1511.03339 (2015)
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. arXiv preprint arXiv:1604.01685 (2016)
Cordts, M., et al.: The stixel world: a medium-level representation of traffic scenes. Image Vis. Comput. 68, 40–52 (2017)
Das, A., Yogamani, S.: Evaluation of residual learning in lightweight deep networks for object classification. In: Proceedings of the Irish Machine Vision and Image Processing Conference, pp. 205–208 (2018)
Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013)
Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: CVPR (2016)
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, vol. 2, p. 7 (2017)
Grangier, D., Bottou, L., Collobert, R.: Deep convolutional networks for scene parsing. In: ICML 2009 Deep Learning Workshop, vol. 3. Citeseer (2009)
Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10111, pp. 213–228. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54181-5_14
Hirschmuller, H.: Accurate and efficient stereo processing by semi-global matching and mutual information. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition CVPR 2005, vol. 2, pp. 807–814. IEEE (2005)
Horgan, J., Hughes, C., McDonald, J., Yogamani, S.: Vision-based driver assistance systems: survey, taxonomy and advances. In: 2015 IEEE 18th International Conference on. Intelligent Transportation Systems (ITSC), pp. 2032–2039. IEEE (2015)
Jain, S.D., Xiong, B., Grauman, K.: Fusionseg: learning to combine motion and appearance for fully automatic segmention of generic objects in videos. arXiv preprint arXiv:1701.05384 (2017)
Kundu, A., Li, Y., Dellaert, F., Li, F., Rehg, J.M.: Joint semantic segmentation and 3D reconstruction from monocular video. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 703–718. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_45
Lin, D., Chen, G., Cohen-Or, D., Heng, P.A., Huang, H.: Cascaded feature network for semantic segmentation of RGB-D images. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1320–1328. IEEE (2017)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Ma, L., Stückler, J., Kerl, C., Cremers, D.: Multi-view deep learning for consistent semantic mapping with RGB-D cameras. arXiv preprint arXiv:1703.08866 (2017)
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048 (2016)
McCormac, J., Handa, A., Davison, A., Leutenegger, S.: Semanticfusion: dense 3D semantic mapping with convolutional neural networks. arXiv preprint arXiv:1609.05130 (2016)
Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1520–1528 (2015)
Qi, G.J.: Hierarchically gated deep networks for semantic segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Siam, M., Elkerdawy, S., Jagersand, M., Yogamani, S.: Deep semantic segmentation for automated driving: taxonomy, roadmap and challenges. arXiv preprint arXiv:1707.02432 (2017)
Siam, M., Mahgoub, H., Zahran, M., Yogamani, S., Jagersand, M., El-Sallab, A.: MODNET: moving object detection network with motion and appearance for autonomous driving. arXiv preprint arXiv:1709.04821 (2017)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. arXiv preprint arXiv:1708.06500 (2017)
Wang, W., Neumann, U.: Depth-aware CNN for RGB-D segmentation. arXiv preprint arXiv:1803.06791 (2018)
Whelan, T., Leutenegger, S., Salas-Moreno, R.F., Glocker, B., Davison, A.J.: Elasticfusion: Dense slam without a pose graph. In: Robotics: Science and Systems, vol. 11 (2015)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Rashed, H., Yogamani, S., El-Sallab, A., Das, A., El-Helw, M. (2019). Depth Augmented Semantic Segmentation Networks for Automated Driving. In: Arora, C., Mitra, K. (eds) Computer Vision Applications. WCVA 2018. Communications in Computer and Information Science, vol 1019. Springer, Singapore. https://doi.org/10.1007/978-981-15-1387-9_1
Download citation
DOI: https://doi.org/10.1007/978-981-15-1387-9_1
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1386-2
Online ISBN: 978-981-15-1387-9
eBook Packages: Computer ScienceComputer Science (R0)