On-Line Large Scale Semantic Fusion

  • Tommaso CavallariEmail author
  • Luigi Di Stefano
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9915)


Recent research towards 3D reconstruction has delivered reliable and fast pipelines to obtain accurate volumetric maps of large environments. Alongside, we witness dramatic improvements in the field of semantic segmentation of images due to deployment of deep learning architectures. In this paper, we pursue bridging the semantic gap of purely geometric representations by leveraging on a SLAM pipeline and a deep neural network so to endow surface patches with category labels. In particular, we present the first system that, based on the input stream provided by a commodity RGB-D sensor, can deliver interactively and automatically a map of a large scale environment featuring both geometric as well as semantic information. We also show how the significant computational cost inherent to deployment of a state-of-the-art deep network for semantic labeling does not hinder interactivity thanks to suitable scheduling of the workload on an off-the-shelf PC platform equipped with two GPUs.


SLAM Deep learning Semantic segmentation Large scale reconstruction Semantic fusion 



We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.

Supplementary material

Supplementary material 1 (mp4 54458 KB)


  1. 1.
    Blais, G., Levine, M.: Registering multiview range data to create 3D computer objects. IEEE Trans. Pattern Anal. Mach. Intell. 17(8), 820–824 (1995)CrossRefGoogle Scholar
  2. 2.
    Bylow, E., Olsson, C.: Robust camera tracking by combining color and depth measurements. In: 2014 22nd International Conference on Pattern Recognition (ICPR) (2014)Google Scholar
  3. 3.
    Bylow, E., Sturm, J., Kerl, C., Kahl, F., Cremers, D.: Real-time camera tracking and 3D reconstruction using signed distance functions. In: Robotics: Science and Systems (RSS) (2013)Google Scholar
  4. 4.
    Canelhas, D.R., Stoyanov, T., Lilienthal, A.J.: SDF tracker: a parallel algorithm for on-line pose estimation and scene reconstruction from depth images. In: IEEE International Conference on Intelligent Robots and Systems, pp. 3671–3676 (2013)Google Scholar
  5. 5.
    Cavallari, T., Stefano, L.: Volume-based semantic labeling with signed distance functions. In: Bräunl, T., McCane, B., Rivera, M., Yu, X. (eds.) PSIVT 2015. LNCS, vol. 9431, pp. 544–556. Springer, Heidelberg (2016). doi: 10.1007/978-3-319-29451-3_43 CrossRefGoogle Scholar
  6. 6.
    Eigen, D., Fergus, R.: Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture. ICCV, November 2015Google Scholar
  7. 7.
    Fioraio, N., Taylor, J., Fitzgibbon, A., Di Stefano, L., Izadi, S.: Large-scale and drift-free surface reconstruction using online subvolume registration. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4475–4483. IEEE, June 2015Google Scholar
  8. 8.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10584-0_23 Google Scholar
  9. 9.
    Kahler, O., Prisacariu, V.A., Ren, C.Y., Sun, X., Torr, P., Murray, D.: Very high frame rate volumetric integration of depth images on mobile devices. IEEE Trans. Visual. Comput. Graph. 21(11), 1241–1250 (2015)CrossRefGoogle Scholar
  10. 10.
    Kahler, O., Prisacariu, V., Valentin, J., Murray, D.: Hierarchical voxel block hashing for efficient integration of depth images. IEEE Rob. Autom. Lett. 3766(c), 1 (2015)Google Scholar
  11. 11.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  12. 12.
    Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution 3D surface construction algorithm. In: Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques - SIGGRAPH 1987, vol. 21, issue 4, pp. 163–169 (1987)Google Scholar
  13. 13.
    Miksik, O., Torr, P.H., Vineet, V., Lidegaard, M., Prasaath, R., Nießner, M., Golodetz, S., Hicks, S.L., Pérez, P., Izadi, S.: The semantic paintbrush. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems - CHI 2015, pp. 3317–3326. ACM, New York, April 2015Google Scholar
  14. 14.
    Newcombe, R.A., Davison, A.J., Izadi, S., Kohli, P., Hilliges, O., Shotton, J., Molyneaux, D., Hodges, S., Kim, D., Fitzgibbon, A.: KinectFusion: real-time dense surface mapping and tracking. In: 2011 10th IEEE International Symposium on Mixed and Augmented Reality, pp. 127–136. IEEE, October 2011Google Scholar
  15. 15.
    Nießner, M., Zollhöfer, M., Izadi, S., Stamminger, M.: Real-time 3D reconstruction at scale using voxel hashing. ACM Trans. Graph. 32(6), 1–11 (2013)CrossRefGoogle Scholar
  16. 16.
    Roth, H., Marsette, V.: Moving volume KinectFusion. In: Proceedings of the British Machine Vision Conference, pp. 112.1–112.11 (2012)Google Scholar
  17. 17.
    Steinbrucker, F., Sturm, J., Cremers, D.: Real-time visual odometry from dense RGB-D images. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 719–722. IEEE, November 2011Google Scholar
  18. 18.
    Valentin, J., Vineet, V., Cheng, M.M., Kim, D., Shotton, J., Kohli, P., Niessner, M., Criminisi, A., Izadi, S., Torr, P.: SemanticPaint: interactive 3D labeling and learning at your fingertips. ACM Trans. Graph. (TOG) (2015)Google Scholar
  19. 19.
    Whelan, T., Johannsson, H., Kaess, M., Leonard, J.J., McDonald, J.: Robust real-time visual odometry for dense RGB-D mapping. In: 2013 IEEE International Conference on Robotics and Automation, vol. 1, pp. 5724–5731. IEEE, May 2013Google Scholar
  20. 20.
    Whelan, T., Kaess, M., Fallon, M.: Kintinuous: spatially extended KinectFusion. In: Robotics Science and Systems (Workshop on RGB-D: Advanced Reasoning with Depth Cameras) (2012)Google Scholar
  21. 21.
    Yang, C., Medioni, G.: Object modelling by registration of multiple range images. Image Vis. Comput. 10, 145–155 (1992). IEEE Computer Society PressCrossRefGoogle Scholar
  22. 22.
    Zeng, M., Zhao, F., Zheng, J., Liu, X.: Octree-based fusion for realtime 3D reconstruction. Graph. Models 75(3), 126–136 (2013)CrossRefGoogle Scholar
  23. 23.
    Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.S.: Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1529–1537 (2015)Google Scholar
  24. 24.
    Zhou, Q.Y., Koltun, V.: Dense scene reconstruction with points of interest. ACM Trans. Graph. 32(4), 112:1–112:8 (2013)zbMATHGoogle Scholar
  25. 25.
    Zhou, Q.Y., Miller, S., Koltun, V.: Elastic fragments for dense scene reconstruction. In: 2013 IEEE International Conference on Computer Vision, pp. 473–480. IEEE, December 2013Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringUniversity of BolognaBolognaItaly

Personalised recommendations