Large-scale 3D Semantic Mapping Using Stereo Vision

Abstract

In recent years, there have been a lot of interests in incorporating semantics into simultaneous localization and mapping (SLAM) systems. This paper presents an approach to generate an outdoor large-scale 3D dense semantic map based on binocular stereo vision. The inputs to system are stereo color images from a moving vehicle. First, dense 3D space around the vehicle is constructed, and the motion of camera is estimated by visual odometry. Meanwhile, semantic segmentation is performed through the deep learning technology online, and the semantic labels are also used to verify the feature matching in visual odometry. These three processes calculate the motion, depth and semantic label of every pixel in the input views. Then, a voxel conditional random field (CRF) inference is introduced to fuse semantic labels to voxel. After that, we present a method to remove the moving objects by incorporating the semantic labels, which improves the motion segmentation accuracy. The last is to generate the dense 3D semantic map of an urban environment from arbitrary long image sequence. We evaluate our approach on KITTI vision benchmark, and the results show that the proposed method is effective.

This is a preview of subscription content, log in to check access.

References

  1. [1]

    G. Ros, S. Ramos, M. Granados, A. Bakhtiary, D. Vazquez, A. M. Lopez. Vision-based offline-online perception paradigm for autonomous driving. In Proceedings of IEEE Winter Conference on Applications of Computer Vision, IEEE, Waikoloa, USA, pp. 231–238, 2015. DOI: 10.1109/WACV.2015.38.

    Google Scholar 

  2. [2]

    J. Mason, B. Marthi. An object-based semantic world model for long-term change detection and semantic querying. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Vilamoura, Portugal, pp. 3851–3858, 2012. DOI: 10.1109/IROS.2012.6385729.

    Google Scholar 

  3. [3]

    A. Nüchter, J. Hertzberg. Towards semantic maps for mobile robots. Robotics and Autonomous Systems, vol. 56, no. 11, pp. 915–926, 2008. DOI: 10.1016/j.robot.2008.08.001.

    Article  Google Scholar 

  4. [4]

    V. Badrinarayanan, A. Kendall, R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017. DOI: 10.1109/TPAMI.2016.2644615.

    Article  Google Scholar 

  5. [5]

    A. Geiger, P. Lenz, R. Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Providence, USA, pp. 3354–3361, 2012. DOI: 10.1109/CVPR.2012.6248074.

    Google Scholar 

  6. [6]

    S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, M. Seitz Steven, R. Szeliski. Building Rome in a day. Communications of the ACM, vol. 54, no. 10, pp. 105–112, 2011. DOI: 10.1145/2001269.2001293.

    Article  Google Scholar 

  7. [7]

    D. Munoz, J. A. Bagnell, N. Vandapel, M. Hebert. Contextual classification with functional max-margin Markov networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Miami, USA, pp. 975–982, 2009. DOI: 10.1109/CVPR.2009.5206590.

    Google Scholar 

  8. [8]

    B. Douillard, D. Fox, F. Ramos, H. Durrant-Whyte. Classification and semantic mapping of urban environments. The International Journal of Robotics Research, vol. 30, no. 1, pp. 5–32, 2011. DOI: 10.1177/0278364910373409.

    Article  Google Scholar 

  9. [9]

    R. Zhang, S. A. Candra, K. Vetter, A. Zakhor. Sensor fusion for semantic segmentation of urban scenes. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Seattle, USA, pp. 1850–1857, 2015. DOI: 10.1109/ICRA.2015.7139439.

    Google Scholar 

  10. [10]

    F. Endres, J. Hess, J. Sturm, D. Cremers, W. Burgard. 3-D mapping with an RGB-D camera. IEEE Transactions on Robotics, vol. 30, no. 1, pp. 177–187, 2014. DOI: 10.1109/TRO.2013.2279412.

    Article  Google Scholar 

  11. [11]

    M. Gunther, T. Wiemann, S. Albrecht, J. Hertzberg. Building semantic object maps from sparse and noisy 3d data. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Tokyo, Japan, pp. 2228–2233, 2013. DOI: 10.1109/IROS.2013.6696668.

    Google Scholar 

  12. [12]

    S. Sengupta, E. Greveson, A. Shahrokni, P. H. S. Torr. Urban 3D semantic modelling using stereo vision. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Karlsruhe, Germany, pp. 580–585, 2013. DOI: 10.1109/ICRA.2013.6630632.

    Google Scholar 

  13. [13]

    N. D. Reddy, P. Singhal, V. Chari, K. M. Krishna. Dynamic body VSLAM with semantic constraints. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Hamburg, Germany, pp. 1897–1904, 2015. DOI: 10.1109/IROS.2015.7353626.

    Google Scholar 

  14. [14]

    J. P. C. Valentin, S. Sengupta, J. Warrell, A. Shahrokni, P. H. S. Torr. Mesh based semantic modelling for indoor and outdoor scenes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Portland, USA, pp. 2067–2074, 2013. DOI: 10.1109/CVPR.2013.269.

    Google Scholar 

  15. [15]

    J. Civera, D. Gálvez-López, L. Riazuelo, J. D. Tardós, J. M. M. Montiel. Towards semantic SLAM using a monocular camera. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, San Francisco, USA, pp. 1277–1284, 2011. DOI: 10.1109/IROS.2011.6094648.

    Google Scholar 

  16. [16]

    V. Vineet, O. Miksik, M. Lidegaard, M. Niessner, S. Golodetz, V. A. Prisacariu, O. Kähler, D. W. Murray, S. Izadi, P. Pérez, P. H. S. Torr. Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Seattle, USA, pp. 75–82, 2015. DOI: 10.1109/ICRA.2015.7138983.

    Google Scholar 

  17. [17]

    D. Scharstein, R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, vol. 47, no. 1–3, pp. 7–42, 2002. DOI: 10.1023/A:1014573219977.

    Article  MATH  Google Scholar 

  18. [18]

    H. Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328–341, 2008. DOI: 10.1109/TPAMI.2007.1166.

    Article  Google Scholar 

  19. [19]

    A. Geiger, M. Roser, R. Urtasun. Efficient large-scale stereo matching. In Proceedings of the 10th Asian Conference on Computer Vision, Springer, Queenstown, New Zealand, pp. 25–38, 2010. DOI: 10.1007/978-3-642-19315-6 3.

    Google Scholar 

  20. [20]

    J. Žbontar, Y. LeCun. Computing the stereo matching cost with a convolutional neural network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 1592–1599, 2015. DOI: 10.1109/CVPR.2015.7298767.

    Google Scholar 

  21. [21]

    P. Krähenbühl, V. Koltun. Efficient inference in fully connected CRFs with Gaussian edge potentials. In Proceedings of Advances in Neural Information Processing Systems, Granada, Spain, pp. 109–117, 2011.

    Google Scholar 

  22. [22]

    F. Qiu, Y. Yang, H. Li, M. Y. Fu, S. T. Wang. Semantic motion segmentation for urban dynamic scene understanding. In Proceedings of IEEE International Conference on Automation Science and Engineering, IEEE, Fort Worth, USA, pp. 497–502, 2016. DOI: 10.1109/COASE.2016.7743446.

    Google Scholar 

  23. [23]

    Z. Hu, K. Uchimura. U-V-disparity: An efficient algorithm for stereovision based scene analysis. In Proceedings of IEEE Intelligent Vehicles Symposium, IEEE, Las Vegas, USA, pp. 48–54, 2005. DOI: 10.1109/IVS.2005.1505076.

    Google Scholar 

  24. [24]

    Y. Li, Y. Ruichek. Occupancy grid mapping in urban environments from a moving on-board stereo-vision system. Sensors, vol. 14, no. 6, pp. 10454–10478, 2014.

    Article  Google Scholar 

  25. [25]

    A. Geiger, J. Ziegler, C. Stiller. StereoScan: Dense 3D reconstruction in real-time. In Proceedings of IEEE Intelligent Vehicles Symposium, IEEE, Baden-Baden Germany, pp. 963–968, 2011. DOI: 10.1109/IVS.2011.5940405.

    Google Scholar 

  26. [26]

    Niessner M, Zollhöfer M, S. Izadi, M. Stamminger. Realtime 3D reconstruction at scale using voxel hashing. ACM Transactions on Graphics, vol. 32, no. 6, Article number 169, 2013. DOI: 10.1145/2508363.2508374.

    Article  Google Scholar 

  27. [27]

    R. Mur-Artal, J. M. M. Montiel, J. D. Tardós. ORBSLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015. DOI: 10.1109/TRO.2015.2463671.

    Article  Google Scholar 

  28. [28]

    M. Menze, A. Geiger. Object scene flow for autonomous vehicles. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 3061–3070, 2015. DOI: 10.1109/CVPR.2015.7298925.

    Google Scholar 

  29. [29]

    L. Ladický, C. Russell, P. Kohli, P. H. S. Torr. Associative hierarchical random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 6, pp. 1056–1077, 2014. DOI: 10.1109/TPAMI.2013.165.

    Article  Google Scholar 

  30. [30]

    S. Sengupta, P. Sturgess, L. Ladický, P. H. S. Torr. Automatic dense visual semantic mapping from streetlevel imagery. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Vilamoura, Portugal, pp. 857–862, 2012. DOI: 10.1109/IROS.2012.6385958.

    Google Scholar 

  31. [31]

    H. He, B. Upcroft. Nonparametric semantic segmentation for 3D street scenes. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Tokyo, Japan, pp. 3697–3703, 2013. DOI: 10.1109/IROS.2013.6696884.

    Google Scholar 

  32. [32]

    A. Kundu, K. M. Krishna, J. Sivaswamy. Moving object detection by multi-view geometric techniques from a single camera mounted robot. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, St. Louis, USA, pp. 4306–4312, 2009. DOI: 10.1109/IROS.2009.5354227.

    Google Scholar 

  33. [33]

    T. H. Lin, C. C. Wang. Deep learning of spatio-temporal features with geometric-based moving point detection for motion segmentation. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Hong Kong, China, pp. 3058–3065, 2014. DOI: 10.1109/ICRA. 2014.6907299.

    Google Scholar 

  34. [34]

    N. D. Reddy, P. Singhal, K. M. Krishna. Semantic motion segmentation using dense CRF formulation. In Proceedings of Indian Conference on Computer Vision Graphics and Image Processing, ACM, Bangalore, India, Article number 56, 2014. DOI: 10.1145/2683483.2683539.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Yi Yang.

Additional information

This work was supported by National Natural Science Foundation of China (Nos. NSFC 61473042 and 61105092) and Beijing Higher Education Young Elite Teacher Project (No.YETP1215).

Recommended by Associate Editor Hong Qiao

Yi Yang received the Ph.D. degree in automation from Beijing Institute of Technology, China in 2010. He is currently an associate professor with the School of Automation, Beijing Institute of Technology, China.

His research interests include autonomous vehicles, bioinspired robots, intelligent navigation, semantic mapping and scene understanding.

Fan Qiu received the B.Eng. degree in automation from the Beijing Institute of Technology, China in 2014, where he is currently a master student in control science and engineering.

His research interests include deep learning, semantic mapping and computer vision.

Hao Li received the B.Eng. degree in automation from the Beijing Institute of Technology, China in 2015, where he is currently a master student in control science and engineering.

His research interests include machine learning, semantic mapping and scene understanding.

Lu Zhang received the B. Eng. degree in automation from the Beijing Institute of Technology, China in 2015, where he is a master student in control science and engineering.

His research interests include SLAM, path planning and computer vision.

Mei-Ling Wang received the M.Eng. and Ph.D. degrees from School of Automation, Beijing Institute of Technology, China in 1995 and 2007, respectively. She is currently a professor with School of Automation, Beijing Institute of Technology, and a Changjiang Scholar of the Ministry of Education of China.

Her research interests include geographic information system, intelligent navigation and unmanned ground vehicles.

Meng-Yin Fu received the M.Eng. degree from School of Automation, Beijing Institute of Technology, China in 1992, and the Ph.D. degree from the Chinese Academy of Sciences, China in 2000. He was a professor with School of Automation, Beijing Institute of Technology, China, from 2000 to 2013. He is currently a professor with the Nanjing University of Science and Technology, China, and a Changjiang Scholar of the Ministry of Education of China.

His interests include integrated navigation system, intelligent navigation and unmanned ground vehicles.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yang, Y., Qiu, F., Li, H. et al. Large-scale 3D Semantic Mapping Using Stereo Vision. Int. J. Autom. Comput. 15, 194–206 (2018). https://doi.org/10.1007/s11633-018-1118-y

Download citation

Keywords

  • Semantic map
  • stereo vision
  • motion segmentation
  • visual odometry
  • simultaneous localization and mapping (SLAM).