Skip to main content

FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture

  • Conference paper
  • First Online:
Computer Vision – ACCV 2016 (ACCV 2016)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10111))

Included in the following conference series:

Abstract

In this paper we address the problem of semantic labeling of indoor scenes on RGB-D data. With the availability of RGB-D cameras, it is expected that additional depth measurement will improve the accuracy. Here we investigate a solution how to incorporate complementary depth information into a semantic segmentation framework by making use of convolutional neural networks (CNNs). Recently encoder-decoder type fully convolutional CNN architectures have achieved a great success in the field of semantic segmentation. Motivated by this observation we propose an encoder-decoder type network, where the encoder part is composed of two branches of networks that simultaneously extract features from RGB and depth images and fuse depth features into the RGB feature maps as the network goes deeper. Comprehensive experimental evaluations demonstrate that the proposed fusion-based architecture achieves competitive results with the state-of-the-art methods on the challenging SUN RGB-D benchmark obtaining 76.27% global accuracy, 48.30% average class accuracy and 37.29% average intersection-over-union score.

C. Hazirbas and L. Ma—The authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The rectified linear unit is defined as \(\sigma (x)=\max (0,x)\).

References

  1. Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). doi:10.1007/978-3-319-10584-0_23

    Google Scholar 

  2. Pinheiro, P.O., Collobert, R.: Recurrent convolutional neural networks for scene labeling. In: Proceedings of International Conference on Machine Learning, Beijing, China (2014)

    Google Scholar 

  3. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. IEEE, Boston (2015)

    Google Scholar 

  4. Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.: Conditional random fields as recurrent neural networks. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1529–1537. IEEE, Santiago (2015)

    Google Scholar 

  5. Byeon, W., Breuel, T.M., Raue, F., Liwicki, M.: Scene labeling with LSTM recurrent neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3547–3555. IEEE, Boston (2015)

    Google Scholar 

  6. Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision (2015)

    Google Scholar 

  7. Lin, G., Shen, C., van den Hengel, A., Reid, I.: Exploring context with deep structured models for semantic segmentation. arXiv preprint arXiv:1603.03183 (2016)

  8. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: Proceedings of International Conference on Learning Representations, San Diego (2015)

    Google Scholar 

  9. Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semantic segmentation using depth information. In: Proceedings of International Conference on Learning Representations (2013)

    Google Scholar 

  10. Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)

    Google Scholar 

  11. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of International Conference on Learning Representations (2015)

    Google Scholar 

  12. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). doi:10.1007/978-3-319-10590-1_53

    Google Scholar 

  13. Badrinarayanan, V., Handa, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293 (2015)

  14. Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian SegNet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680 (2015)

  15. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  16. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. Computing Research Repository (2015)

    Google Scholar 

  17. Li, L.Z., Yukang, G., Xiaodan, L., Yizhou, Y., Hui, C., Liang, L.: RGB-D Scene labeling with long short-term memorized fusion model. arXiv preprint arXiv:1604.05000v2 (2016)

  18. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Bach, F.R., Blei, D.M. (eds.): Proceedings of International Conference on Machine Learning, JMLR Proceedings, vol. 37, pp. 448–456. JMLR.org (2015)

    Google Scholar 

  19. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)

    Google Scholar 

  20. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)

  21. Bottou, Léon: Stochastic gradient descent tricks. In: Montavon, Grégoire, Orr, Geneviève, B., Müller, Klaus-Robert (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 421–436. Springer, Heidelberg (2012). doi:10.1007/978-3-642-35289-8_25

    Chapter  Google Scholar 

  22. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgement

This work was partially supported by the ERC Consolidator Grant “3D Reloaded” and by the Alexander von Humboldt Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Caner Hazirbas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Hazirbas, C., Ma, L., Domokos, C., Cremers, D. (2017). FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture. In: Lai, SH., Lepetit, V., Nishino, K., Sato, Y. (eds) Computer Vision – ACCV 2016. ACCV 2016. Lecture Notes in Computer Science(), vol 10111. Springer, Cham. https://doi.org/10.1007/978-3-319-54181-5_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-54181-5_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-54180-8

  • Online ISBN: 978-3-319-54181-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics