Multimodal Neural Networks: RGB-D for Semantic Segmentation and Object Detection

  • Lukas SchneiderEmail author
  • Manuel Jasch
  • Björn Fröhlich
  • Thomas Weber
  • Uwe Franke
  • Marc Pollefeys
  • Matthias Rätsch
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10269)


This paper presents a novel multi-modal CNN architecture that exploits complementary input cues in addition to sole color information. The joint model implements a mid-level fusion that allows the network to exploit cross-modal interdependencies already on a medium feature-level. The benefit of the presented architecture is shown for the RGB-D image understanding task. So far, state-of-the-art RGB-D CNNs have used network weights trained on color data. In contrast, a superior initialization scheme is proposed to pre-train the depth branch of the multi-modal CNN independently. In an end-to-end training the network parameters are optimized jointly using the challenging Cityscapes dataset. In thorough experiments, the effectiveness of the proposed model is shown. Both, the RGB GoogLeNet and further RGB-D baselines are outperformed with a significant margin on two different task: semantic segmentation and object detection. For the latter, this paper shows how to extract object-level groundtruth from the instance level annotations in Cityscapes in order to train a powerful object detector.


Convolutional Neural Network Depth Data Late Fusion Early Fusion Convolutional Layer 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. In: CVPR (2015)Google Scholar
  2. 2.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLR (2014)Google Scholar
  3. 3.
    Chen, L.C., Yuille, A.L., Urtasun, R.: Learning deep structured models. In: ICML (2015)Google Scholar
  4. 4.
    Chen, X., Kundu, K., Zhu, Y., Berneshawi, A., Ma, H., Fidler, S., Urtasun, R.: 3D object proposals for accurate object class detection. In: NIPS (2015)Google Scholar
  5. 5.
    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: Benchmark of Cityscapes dataset., Accessed 27 Aug 2016
  6. 6.
    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The Cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)Google Scholar
  7. 7.
    Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semantic segmentation using depth information. In: ICLR (2013)Google Scholar
  8. 8.
    Couprie, C., Najman, L., Lecun, Y.: Learning Hierarchical features for scene labeling. Trans. PAMI 35(8), 1915–1929 (2013)CrossRefGoogle Scholar
  9. 9.
    Deng, Z., Todorovic, S., Jan Latecki, L.: Semantic segmentation of RGBD images with mutex constraints. In: ICCV (2015)Google Scholar
  10. 10.
    Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M., Burgard, W.: Multimodal deep learning for robust RGB-D object recognition. In: IROS (2015)Google Scholar
  11. 11.
    Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. IJCV 111(1), 98–136 (2015)CrossRefGoogle Scholar
  12. 12.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)Google Scholar
  13. 13.
    Girshick, R.: Fast R-CNN. In: ICCV (2015)Google Scholar
  14. 14.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  15. 15.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). doi: 10.1007/978-3-319-10584-0_23 Google Scholar
  16. 16.
    Gupta, S., Hoffman, J., Malik, J.: Cross modal distillation for supervision transfer. In: CVPR (2015)Google Scholar
  17. 17.
    Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: Fusenet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: ACCV (2016)Google Scholar
  18. 18.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2015)Google Scholar
  19. 19.
    Zhao, H., Jianping Shi, X.Q., Wang, X., Jia, J.: Pyramid scene parsing network. ArXiv (2016)Google Scholar
  20. 20.
    Hirschmüller, H.: Stereo processing by semiglobal matching and mutual information. Trans. PAMI 30(2), 328–341 (2008)CrossRefGoogle Scholar
  21. 21.
    Hou, S., Wang, Z., Wu, F.: Deeply exploit depth information for object detection. In: CVPRW (2016)Google Scholar
  22. 22.
    Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., Murphy, K.: Speed/accuracy trade-offs for modern convolutional object detectors. ArXiv (2016)Google Scholar
  23. 23.
    Jifeng Dai, Y.L., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NIPS (2016)Google Scholar
  24. 24.
    Khan, S.H., Bennamoun, M., Sohel, F., Togneri, R., Naseem, I.: Integrating geometrical context for semantic labeling of indoor scenes using RGBD images. IJCV 117(1), 1–20 (2016)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Krešo, I., Čaušević, D., Krapac, J., Šegvić, S.: Convolutional scale invariance for semantic segmentation. In: Rosenhahn, B., Andres, B. (eds.) GCPR 2016. LNCS, vol. 9796, pp. 64–75. Springer, Cham (2016). doi: 10.1007/978-3-319-45886-1_6 CrossRefGoogle Scholar
  26. 26.
    Li, Z., Gan, Y., Liang, X., Yu, Y., Cheng, H., Lin, L.: LSTM-CF: unifying context modeling and fusion with LSTMs for RGB-D scene labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 541–557. Springer, Cham (2016). doi: 10.1007/978-3-319-46475-6_34 CrossRefGoogle Scholar
  27. 27.
    Lin, M., Chen, Q., Yan, S.: Network in network. In: ICLR (2013)Google Scholar
  28. 28.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). doi: 10.1007/978-3-319-10602-1_48 Google Scholar
  29. 29.
    Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). doi: 10.1007/978-3-319-46448-0_2 CrossRefGoogle Scholar
  30. 30.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2014)Google Scholar
  31. 31.
    M. Jasch, T. Weber, M.R.: Fast and robust RGB-D scene labeling for autonomous driving. In: ICSCC, JCP (2016, to appear)Google Scholar
  32. 32.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)Google Scholar
  33. 33.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. IJCV 15(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., Lecun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. In: ICLR (2014)Google Scholar
  35. 35.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Ecology (2015)Google Scholar
  36. 36.
    Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: a RGB-D scene understanding benchmark suite. In: CVPR (2015)Google Scholar
  37. 37.
    Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.A.: Semantic scene completion from a single depth image. In: CVPR (2017, to appear)Google Scholar
  38. 38.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P.: Going deeper with convolutions. In: CVPR (2014)Google Scholar
  39. 39.
    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2015)Google Scholar
  40. 40.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: NIPS (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Lukas Schneider
    • 1
    • 2
    Email author
  • Manuel Jasch
    • 3
  • Björn Fröhlich
    • 1
  • Thomas Weber
    • 3
  • Uwe Franke
    • 1
  • Marc Pollefeys
    • 2
    • 4
  • Matthias Rätsch
    • 3
  1. 1.Daimler AGStuttgartGermany
  2. 2.ETH ZurichZurichSwitzerland
  3. 3.Reutlingen UniversityReutlingenGermany
  4. 4.Microsoft CorporationSeattleUSA

Personalised recommendations