The Visual Computer

, Volume 34, Issue 5, pp 735–747 | Cite as

Multi-class indoor semantic segmentation with deep structured model

  • Chuanxia Zheng
  • Jianhua Wang
  • Weihai ChenEmail author
  • Xingming Wu
Original Article


Indoor semantic segmentation plays a critical role in many applications, such as intelligent robots. However, multi-class recognition is still challenging, especially for pixel-level indoor semantic labeling. In this paper, a novel deep structured model that combines the strengths of the widely used convolutional neural networks (CNNs) and recurrent neural networks (RNNs) is proposed. We first present a multi-information fusion model that utilizes the scene category information to fine-tune the fully convolutional network. Then, to refine the coarse outputs of CNN, the RNN is applied to the final CNN layer so that we can build an end-to-end trainable system. This Graph-RNN is transformed from a conditional random field based on superpixel segmentation graphical modeling that can utilize flexible contextual information of different neighboring regions. The experimental results on the recent large SUN RGB-D dataset demonstrate that the proposed model outperforms existing state-of-the-art methods on the challenging 40 dominant classes task (\(40.8\%\) mean IU accuracy and \(69.1\%\) pixel accuracy). We also evaluate our model on the public NYU depth V2 dataset and achieve remarkable performance.


Semantic segmentation Scene classification Convolutional neural network Graph-RNN Conditional random field 



The work described in this paper was supported by National Science Foundation of China under the research Project Grant Nos. 61573048, 61620106012, the International Scientific and Technological Cooperation Projects of China under Grant No. 2015DFG12650, and the Key Laboratory of Robotics and Intelligent Manufacturing Equipment Technology of Zhejiang Province.


  1. 1.
    Anand, A., Koppula, H.S., Joachims, T., Saxena, A.: Contextually guided semantic labeling and search for three-dimensional point clouds. Int. J. Robot. Res. 32(1):19–34 (2013)Google Scholar
  2. 2.
    Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 898–916 (2011)CrossRefGoogle Scholar
  3. 3.
    Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material recognition in the wild with the materials in context database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3479–3487 (2015)Google Scholar
  4. 4.
    Bo, L., Ren, X., Fox, D.: Unsupervised feature learning for RGB-D based object recognition. In: Experimental Robotics, pp. 387–402. Springer (2013)Google Scholar
  5. 5.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semanticimage segmentation with deep convolutional nets and fully connected CRFS. In: International Conference on Learning Representations, pp. 357–361. ICLR, Hilton San Diego Resort (2015)Google Scholar
  6. 6.
    Chen, W., Yue, H., Wang, J., Wu, X.: An improved edge detection algorithm for depth map inpainting. Opt. Lasers Eng. 55, 69–77 (2014)CrossRefGoogle Scholar
  7. 7.
    Cheng, M.M., Zheng, S., Lin, W.Y., Vineet, V., Sturgess, P., Crook, N., Mitra, N.J., Torr, P.: Imagespirit: verbal guided image parsing. ACM Trans. Graph. 3(1), 3:1–3:11 (2014). doi: 10.1145/2682628 Google Scholar
  8. 8.
    Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semantic segmentation using depth information. arXiv preprint arXiv:1301.3572 (2013)
  9. 9.
    Deng, Z., Todorovic, S., Jan Latecki, L.: Semantic segmentation of RGBD images with mutex constraints. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1733–1741 (2015)Google Scholar
  10. 10.
    Ding, K., Chen, W., Wu, X.: Optimum inpainting for depth map based on l 0 total variation. Vis. Comput. 30(12), 1311–1320 (2014)CrossRefGoogle Scholar
  11. 11.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)Google Scholar
  12. 12.
    Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013)CrossRefGoogle Scholar
  13. 13.
    Girshick, R.: Fast R-CNN. In: The IEEE International Conference on Computer Vision (ICCV). ICCV, Santiago, Chile (2015)Google Scholar
  14. 14.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587. IEEE (2014)Google Scholar
  15. 15.
    Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 564–571. IEEE (2013)Google Scholar
  16. 16.
    Gupta, S., Girshick, R., Arbelaez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Computer Vision–ECCV 2014, pp. 345–360. Springer (2014)Google Scholar
  17. 17.
    Hariharan, B., Arbelaez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: European Conference on Computer Vision (ECCV), pp. 297–312(2014)Google Scholar
  18. 18.
    Hayat, M., Khan, S.H., Bennamoun, M.: A spatial layout and scale invariant feature representation for indoor scene classification. IEEE Trans. Image Process. 25(10), 4829–4841 (2016)Google Scholar
  19. 19.
    Hermans, A., Floros, G., Leibe, B.: Dense 3D semantic mapping of indoor scenes from RGB-D images. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 2631–2638. IEEE (2014)Google Scholar
  20. 20.
    Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. Int. J. Comput. Vis. 80(1), 3–15 (2008)CrossRefGoogle Scholar
  21. 21.
    Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. arXiv:1608.06993 (2016)
  22. 22.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. ACM (2014)Google Scholar
  23. 23.
    Khan, S.H., Bennamoun, M., Sohel, F., Togneri, R., Naseem, I.: Integrating geometrical context for semantic labeling of indoor scenes using rgbd images. Int. J. Comput. Vis. 117(1), 1–20 (2016)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFS with Gaussian edge potentials. Adv. Neural Inf. Process. Syst. 109–117 (2011)Google Scholar
  25. 25.
    Koppula, H.S., Anand, A., Joachims, T., Saxena, A.: Semantic labeling of 3D point clouds for indoor scenes. In: Advances in Neural Information Processing Systems (NIPS), pp. 244–252 (2011)Google Scholar
  26. 26.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105 (2012)Google Scholar
  27. 27.
    Lai, K., Bo, L., Fox, D.: Unsupervised feature learning for 3D scene labeling. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 3050–3057. IEEE (2014)Google Scholar
  28. 28.
    Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  29. 29.
    LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)CrossRefGoogle Scholar
  30. 30.
    Li, Z., Gan, Y., Liang, X., Yu, Y., Cheng, H., Lin, L.: LSTM-CF: Unifying context modeling and fusion with LSTMS for RGB-D scene labeling. In: European Conference on Computer Vision, pp. 541–557. Springer (2016)Google Scholar
  31. 31.
    Liang, X., Shen, X., Xiang, D., Feng, J., Lin, L., Yan, S.: Semantic object parsing with local-global long short-term memory. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  32. 32.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1337–1342. CVPR, Boston, MA, USA (2015)Google Scholar
  33. 33.
    Nathan Silberman Derek Hoiem, P.K., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: ECCV (2012)Google Scholar
  34. 34.
    Ren, X., Bo, L., Fox, D.: RGB-(B) scene labeling: Features and algorithms. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2759–2766. IEEE (2012)Google Scholar
  35. 35.
    Shotton, J., Winn, J., Rother, C., Criminisi, A.: Textonboost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context. Int. J. Comput. Vis. 81(1), 2–23 (2009)CrossRefGoogle Scholar
  36. 36.
    Shuai, B., Zuo, Z., Wang, B., Wang, G.: Dag-recurrent neural networks for scene labeling. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  37. 37.
    Silberman, N., Fergus, R.: Indoor scene segmentation using a structured light sensor. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 601–608. IEEE (2011)Google Scholar
  38. 38.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Sci. arXiv:1409.1556 (2014)
  39. 39.
    Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)Google Scholar
  40. 40.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. arXiv preprint arXiv:1409.4842 (2014)
  41. 41.
    Tighe, J., Lazebnik, S.: Superparsing: scalable nonparametric image parsing with superpixels. In: Computer Vision–ECCV 2010, pp. 352–365. Springer (2010)Google Scholar
  42. 42.
    van den Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016)
  43. 43.
    Wang, J., Zheng, C., Chen, W., Wu, X.: Learning aggregated features and optimizing model for semantic labeling. Vis. Comput. 1–14 (2016)Google Scholar
  44. 44.
    Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neuralnetworks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1529–1537 (2015)Google Scholar
  45. 45.
    Zhou, B., Garcia, A.L., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. Adv. Neural Inf. Process. Syst. 1, 487–495 (2014)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  • Chuanxia Zheng
    • 1
  • Jianhua Wang
    • 1
  • Weihai Chen
    • 1
    Email author
  • Xingming Wu
    • 1
  1. 1.School of Automation Science and Electrical EngineeringBeihang UniversityBeijingPeople’s Republic of China

Personalised recommendations