Multi-class indoor semantic segmentation with deep structured model

Abstract

Indoor semantic segmentation plays a critical role in many applications, such as intelligent robots. However, multi-class recognition is still challenging, especially for pixel-level indoor semantic labeling. In this paper, a novel deep structured model that combines the strengths of the widely used convolutional neural networks (CNNs) and recurrent neural networks (RNNs) is proposed. We first present a multi-information fusion model that utilizes the scene category information to fine-tune the fully convolutional network. Then, to refine the coarse outputs of CNN, the RNN is applied to the final CNN layer so that we can build an end-to-end trainable system. This Graph-RNN is transformed from a conditional random field based on superpixel segmentation graphical modeling that can utilize flexible contextual information of different neighboring regions. The experimental results on the recent large SUN RGB-D dataset demonstrate that the proposed model outperforms existing state-of-the-art methods on the challenging 40 dominant classes task (\(40.8\%\) mean IU accuracy and \(69.1\%\) pixel accuracy). We also evaluate our model on the public NYU depth V2 dataset and achieve remarkable performance.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    http://research.cs.washington.edu/istc/lfb/[34].

  2. 2.

    http://cs.nyu.edu/~silberman/code.html[33].

  3. 3.

    http://www.cs.berkeley.edu/~sgupta/[16].

  4. 4.

    https://github.com/BVLC/caffe/wiki/Model-Zoo#fcn[32].

References

  1. 1.

    Anand, A., Koppula, H.S., Joachims, T., Saxena, A.: Contextually guided semantic labeling and search for three-dimensional point clouds. Int. J. Robot. Res. 32(1):19–34 (2013)

  2. 2.

    Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 898–916 (2011)

    Article  Google Scholar 

  3. 3.

    Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material recognition in the wild with the materials in context database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3479–3487 (2015)

  4. 4.

    Bo, L., Ren, X., Fox, D.: Unsupervised feature learning for RGB-D based object recognition. In: Experimental Robotics, pp. 387–402. Springer (2013)

  5. 5.

    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semanticimage segmentation with deep convolutional nets and fully connected CRFS. In: International Conference on Learning Representations, pp. 357–361. ICLR, Hilton San Diego Resort (2015)

  6. 6.

    Chen, W., Yue, H., Wang, J., Wu, X.: An improved edge detection algorithm for depth map inpainting. Opt. Lasers Eng. 55, 69–77 (2014)

    Article  Google Scholar 

  7. 7.

    Cheng, M.M., Zheng, S., Lin, W.Y., Vineet, V., Sturgess, P., Crook, N., Mitra, N.J., Torr, P.: Imagespirit: verbal guided image parsing. ACM Trans. Graph. 3(1), 3:1–3:11 (2014). doi:10.1145/2682628

    Google Scholar 

  8. 8.

    Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semantic segmentation using depth information. arXiv preprint arXiv:1301.3572 (2013)

  9. 9.

    Deng, Z., Todorovic, S., Jan Latecki, L.: Semantic segmentation of RGBD images with mutex constraints. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1733–1741 (2015)

  10. 10.

    Ding, K., Chen, W., Wu, X.: Optimum inpainting for depth map based on l 0 total variation. Vis. Comput. 30(12), 1311–1320 (2014)

    Article  Google Scholar 

  11. 11.

    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)

  12. 12.

    Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013)

    Article  Google Scholar 

  13. 13.

    Girshick, R.: Fast R-CNN. In: The IEEE International Conference on Computer Vision (ICCV). ICCV, Santiago, Chile (2015)

  14. 14.

    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587. IEEE (2014)

  15. 15.

    Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 564–571. IEEE (2013)

  16. 16.

    Gupta, S., Girshick, R., Arbelaez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Computer Vision–ECCV 2014, pp. 345–360. Springer (2014)

  17. 17.

    Hariharan, B., Arbelaez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: European Conference on Computer Vision (ECCV), pp. 297–312(2014)

  18. 18.

    Hayat, M., Khan, S.H., Bennamoun, M.: A spatial layout and scale invariant feature representation for indoor scene classification. IEEE Trans. Image Process. 25(10), 4829–4841 (2016)

  19. 19.

    Hermans, A., Floros, G., Leibe, B.: Dense 3D semantic mapping of indoor scenes from RGB-D images. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 2631–2638. IEEE (2014)

  20. 20.

    Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. Int. J. Comput. Vis. 80(1), 3–15 (2008)

    Article  Google Scholar 

  21. 21.

    Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. arXiv:1608.06993 (2016)

  22. 22.

    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. ACM (2014)

  23. 23.

    Khan, S.H., Bennamoun, M., Sohel, F., Togneri, R., Naseem, I.: Integrating geometrical context for semantic labeling of indoor scenes using rgbd images. Int. J. Comput. Vis. 117(1), 1–20 (2016)

    MathSciNet  Article  Google Scholar 

  24. 24.

    Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFS with Gaussian edge potentials. Adv. Neural Inf. Process. Syst. 109–117 (2011)

  25. 25.

    Koppula, H.S., Anand, A., Joachims, T., Saxena, A.: Semantic labeling of 3D point clouds for indoor scenes. In: Advances in Neural Information Processing Systems (NIPS), pp. 244–252 (2011)

  26. 26.

    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105 (2012)

  27. 27.

    Lai, K., Bo, L., Fox, D.: Unsupervised feature learning for 3D scene labeling. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 3050–3057. IEEE (2014)

  28. 28.

    Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  29. 29.

    LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)

    Article  Google Scholar 

  30. 30.

    Li, Z., Gan, Y., Liang, X., Yu, Y., Cheng, H., Lin, L.: LSTM-CF: Unifying context modeling and fusion with LSTMS for RGB-D scene labeling. In: European Conference on Computer Vision, pp. 541–557. Springer (2016)

  31. 31.

    Liang, X., Shen, X., Xiang, D., Feng, J., Lin, L., Yan, S.: Semantic object parsing with local-global long short-term memory. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  32. 32.

    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1337–1342. CVPR, Boston, MA, USA (2015)

  33. 33.

    Nathan Silberman Derek Hoiem, P.K., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: ECCV (2012)

  34. 34.

    Ren, X., Bo, L., Fox, D.: RGB-(B) scene labeling: Features and algorithms. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2759–2766. IEEE (2012)

  35. 35.

    Shotton, J., Winn, J., Rother, C., Criminisi, A.: Textonboost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context. Int. J. Comput. Vis. 81(1), 2–23 (2009)

    Article  Google Scholar 

  36. 36.

    Shuai, B., Zuo, Z., Wang, B., Wang, G.: Dag-recurrent neural networks for scene labeling. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  37. 37.

    Silberman, N., Fergus, R.: Indoor scene segmentation using a structured light sensor. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 601–608. IEEE (2011)

  38. 38.

    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Sci. arXiv:1409.1556 (2014)

  39. 39.

    Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)

  40. 40.

    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. arXiv preprint arXiv:1409.4842 (2014)

  41. 41.

    Tighe, J., Lazebnik, S.: Superparsing: scalable nonparametric image parsing with superpixels. In: Computer Vision–ECCV 2010, pp. 352–365. Springer (2010)

  42. 42.

    van den Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016)

  43. 43.

    Wang, J., Zheng, C., Chen, W., Wu, X.: Learning aggregated features and optimizing model for semantic labeling. Vis. Comput. 1–14 (2016)

  44. 44.

    Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neuralnetworks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1529–1537 (2015)

  45. 45.

    Zhou, B., Garcia, A.L., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. Adv. Neural Inf. Process. Syst. 1, 487–495 (2014)

    Google Scholar 

Download references

Acknowledgements

The work described in this paper was supported by National Science Foundation of China under the research Project Grant Nos. 61573048, 61620106012, the International Scientific and Technological Cooperation Projects of China under Grant No. 2015DFG12650, and the Key Laboratory of Robotics and Intelligent Manufacturing Equipment Technology of Zhejiang Province.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Weihai Chen.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zheng, C., Wang, J., Chen, W. et al. Multi-class indoor semantic segmentation with deep structured model. Vis Comput 34, 735–747 (2018). https://doi.org/10.1007/s00371-017-1411-8

Download citation

Keywords

  • Semantic segmentation
  • Scene classification
  • Convolutional neural network
  • Graph-RNN
  • Conditional random field