Advertisement

Large-scale 3D Semantic Mapping Using Stereo Vision

  • Yi Yang
  • Fan Qiu
  • Hao Li
  • Lu Zhang
  • Mei-Ling Wang
  • Meng-Yin Fu
Research Article

Abstract

In recent years, there have been a lot of interests in incorporating semantics into simultaneous localization and mapping (SLAM) systems. This paper presents an approach to generate an outdoor large-scale 3D dense semantic map based on binocular stereo vision. The inputs to system are stereo color images from a moving vehicle. First, dense 3D space around the vehicle is constructed, and the motion of camera is estimated by visual odometry. Meanwhile, semantic segmentation is performed through the deep learning technology online, and the semantic labels are also used to verify the feature matching in visual odometry. These three processes calculate the motion, depth and semantic label of every pixel in the input views. Then, a voxel conditional random field (CRF) inference is introduced to fuse semantic labels to voxel. After that, we present a method to remove the moving objects by incorporating the semantic labels, which improves the motion segmentation accuracy. The last is to generate the dense 3D semantic map of an urban environment from arbitrary long image sequence. We evaluate our approach on KITTI vision benchmark, and the results show that the proposed method is effective.

Keywords

Semantic map stereo vision motion segmentation visual odometry simultaneous localization and mapping (SLAM). 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    G. Ros, S. Ramos, M. Granados, A. Bakhtiary, D. Vazquez, A. M. Lopez. Vision-based offline-online perception paradigm for autonomous driving. In Proceedings of IEEE Winter Conference on Applications of Computer Vision, IEEE, Waikoloa, USA, pp. 231–238, 2015. DOI: 10.1109/WACV.2015.38.Google Scholar
  2. [2]
    J. Mason, B. Marthi. An object-based semantic world model for long-term change detection and semantic querying. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Vilamoura, Portugal, pp. 3851–3858, 2012. DOI: 10.1109/IROS.2012.6385729.Google Scholar
  3. [3]
    A. Nüchter, J. Hertzberg. Towards semantic maps for mobile robots. Robotics and Autonomous Systems, vol. 56, no. 11, pp. 915–926, 2008. DOI: 10.1016/j.robot.2008.08.001.CrossRefGoogle Scholar
  4. [4]
    V. Badrinarayanan, A. Kendall, R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017. DOI: 10.1109/TPAMI.2016.2644615.CrossRefGoogle Scholar
  5. [5]
    A. Geiger, P. Lenz, R. Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Providence, USA, pp. 3354–3361, 2012. DOI: 10.1109/CVPR.2012.6248074.Google Scholar
  6. [6]
    S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, M. Seitz Steven, R. Szeliski. Building Rome in a day. Communications of the ACM, vol. 54, no. 10, pp. 105–112, 2011. DOI: 10.1145/2001269.2001293.CrossRefGoogle Scholar
  7. [7]
    D. Munoz, J. A. Bagnell, N. Vandapel, M. Hebert. Contextual classification with functional max-margin Markov networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Miami, USA, pp. 975–982, 2009. DOI: 10.1109/CVPR.2009.5206590.Google Scholar
  8. [8]
    B. Douillard, D. Fox, F. Ramos, H. Durrant-Whyte. Classification and semantic mapping of urban environments. The International Journal of Robotics Research, vol. 30, no. 1, pp. 5–32, 2011. DOI: 10.1177/0278364910373409.CrossRefGoogle Scholar
  9. [9]
    R. Zhang, S. A. Candra, K. Vetter, A. Zakhor. Sensor fusion for semantic segmentation of urban scenes. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Seattle, USA, pp. 1850–1857, 2015. DOI: 10.1109/ICRA.2015.7139439.Google Scholar
  10. [10]
    F. Endres, J. Hess, J. Sturm, D. Cremers, W. Burgard. 3-D mapping with an RGB-D camera. IEEE Transactions on Robotics, vol. 30, no. 1, pp. 177–187, 2014. DOI: 10.1109/TRO.2013.2279412.CrossRefGoogle Scholar
  11. [11]
    M. Gunther, T. Wiemann, S. Albrecht, J. Hertzberg. Building semantic object maps from sparse and noisy 3d data. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Tokyo, Japan, pp. 2228–2233, 2013. DOI: 10.1109/IROS.2013.6696668.Google Scholar
  12. [12]
    S. Sengupta, E. Greveson, A. Shahrokni, P. H. S. Torr. Urban 3D semantic modelling using stereo vision. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Karlsruhe, Germany, pp. 580–585, 2013. DOI: 10.1109/ICRA.2013.6630632.Google Scholar
  13. [13]
    N. D. Reddy, P. Singhal, V. Chari, K. M. Krishna. Dynamic body VSLAM with semantic constraints. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Hamburg, Germany, pp. 1897–1904, 2015. DOI: 10.1109/IROS.2015.7353626.Google Scholar
  14. [14]
    J. P. C. Valentin, S. Sengupta, J. Warrell, A. Shahrokni, P. H. S. Torr. Mesh based semantic modelling for indoor and outdoor scenes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Portland, USA, pp. 2067–2074, 2013. DOI: 10.1109/CVPR.2013.269.Google Scholar
  15. [15]
    J. Civera, D. Gálvez-López, L. Riazuelo, J. D. Tardós, J. M. M. Montiel. Towards semantic SLAM using a monocular camera. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, San Francisco, USA, pp. 1277–1284, 2011. DOI: 10.1109/IROS.2011.6094648.Google Scholar
  16. [16]
    V. Vineet, O. Miksik, M. Lidegaard, M. Niessner, S. Golodetz, V. A. Prisacariu, O. Kähler, D. W. Murray, S. Izadi, P. Pérez, P. H. S. Torr. Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Seattle, USA, pp. 75–82, 2015. DOI: 10.1109/ICRA.2015.7138983.Google Scholar
  17. [17]
    D. Scharstein, R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, vol. 47, no. 1–3, pp. 7–42, 2002. DOI: 10.1023/A:1014573219977.CrossRefzbMATHGoogle Scholar
  18. [18]
    H. Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328–341, 2008. DOI: 10.1109/TPAMI.2007.1166.CrossRefGoogle Scholar
  19. [19]
    A. Geiger, M. Roser, R. Urtasun. Efficient large-scale stereo matching. In Proceedings of the 10th Asian Conference on Computer Vision, Springer, Queenstown, New Zealand, pp. 25–38, 2010. DOI: 10.1007/978-3-642-19315-6 3.Google Scholar
  20. [20]
    J. Žbontar, Y. LeCun. Computing the stereo matching cost with a convolutional neural network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 1592–1599, 2015. DOI: 10.1109/CVPR.2015.7298767.Google Scholar
  21. [21]
    P. Krähenbühl, V. Koltun. Efficient inference in fully connected CRFs with Gaussian edge potentials. In Proceedings of Advances in Neural Information Processing Systems, Granada, Spain, pp. 109–117, 2011.Google Scholar
  22. [22]
    F. Qiu, Y. Yang, H. Li, M. Y. Fu, S. T. Wang. Semantic motion segmentation for urban dynamic scene understanding. In Proceedings of IEEE International Conference on Automation Science and Engineering, IEEE, Fort Worth, USA, pp. 497–502, 2016. DOI: 10.1109/COASE.2016.7743446.Google Scholar
  23. [23]
    Z. Hu, K. Uchimura. U-V-disparity: An efficient algorithm for stereovision based scene analysis. In Proceedings of IEEE Intelligent Vehicles Symposium, IEEE, Las Vegas, USA, pp. 48–54, 2005. DOI: 10.1109/IVS.2005.1505076.Google Scholar
  24. [24]
    Y. Li, Y. Ruichek. Occupancy grid mapping in urban environments from a moving on-board stereo-vision system. Sensors, vol. 14, no. 6, pp. 10454–10478, 2014.CrossRefGoogle Scholar
  25. [25]
    A. Geiger, J. Ziegler, C. Stiller. StereoScan: Dense 3D reconstruction in real-time. In Proceedings of IEEE Intelligent Vehicles Symposium, IEEE, Baden-Baden Germany, pp. 963–968, 2011. DOI: 10.1109/IVS.2011.5940405.Google Scholar
  26. [26]
    Niessner M, Zollhöfer M, S. Izadi, M. Stamminger. Realtime 3D reconstruction at scale using voxel hashing. ACM Transactions on Graphics, vol. 32, no. 6, Article number 169, 2013. DOI: 10.1145/2508363.2508374.CrossRefGoogle Scholar
  27. [27]
    R. Mur-Artal, J. M. M. Montiel, J. D. Tardós. ORBSLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015. DOI: 10.1109/TRO.2015.2463671.CrossRefGoogle Scholar
  28. [28]
    M. Menze, A. Geiger. Object scene flow for autonomous vehicles. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 3061–3070, 2015. DOI: 10.1109/CVPR.2015.7298925.Google Scholar
  29. [29]
    L. Ladický, C. Russell, P. Kohli, P. H. S. Torr. Associative hierarchical random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 6, pp. 1056–1077, 2014. DOI: 10.1109/TPAMI.2013.165.CrossRefGoogle Scholar
  30. [30]
    S. Sengupta, P. Sturgess, L. Ladický, P. H. S. Torr. Automatic dense visual semantic mapping from streetlevel imagery. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Vilamoura, Portugal, pp. 857–862, 2012. DOI: 10.1109/IROS.2012.6385958.Google Scholar
  31. [31]
    H. He, B. Upcroft. Nonparametric semantic segmentation for 3D street scenes. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Tokyo, Japan, pp. 3697–3703, 2013. DOI: 10.1109/IROS.2013.6696884.Google Scholar
  32. [32]
    A. Kundu, K. M. Krishna, J. Sivaswamy. Moving object detection by multi-view geometric techniques from a single camera mounted robot. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, St. Louis, USA, pp. 4306–4312, 2009. DOI: 10.1109/IROS.2009.5354227.Google Scholar
  33. [33]
    T. H. Lin, C. C. Wang. Deep learning of spatio-temporal features with geometric-based moving point detection for motion segmentation. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Hong Kong, China, pp. 3058–3065, 2014. DOI: 10.1109/ICRA. 2014.6907299.Google Scholar
  34. [34]
    N. D. Reddy, P. Singhal, K. M. Krishna. Semantic motion segmentation using dense CRF formulation. In Proceedings of Indian Conference on Computer Vision Graphics and Image Processing, ACM, Bangalore, India, Article number 56, 2014. DOI: 10.1145/2683483.2683539.Google Scholar

Copyright information

© Institute of Automation, Chinese Academy of Sciences and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Automation and National Key Laboratory of Intelligent Control and Decision of Complex SystemsBeijing Institute of TechnologyBeijingChina
  2. 2.Nanjing University of Science and TechnologyNanjingChina

Personalised recommendations