Skip to main content
Log in

Self-Supervised Model Adaptation for Multimodal Semantic Segmentation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Learning to reliably perceive and understand the scene is an integral enabler for robots to operate in the real-world. This problem is inherently challenging due to the multitude of object types as well as appearance changes caused by varying illumination and weather conditions. Leveraging complementary modalities can enable learning of semantically richer representations that are resilient to such perturbations. Despite the tremendous progress in recent years, most multimodal convolutional neural network approaches directly concatenate feature maps from individual modality streams rendering the model incapable of focusing only on the relevant complementary information for fusion. To address this limitation, we propose a mutimodal semantic segmentation framework that dynamically adapts the fusion of modality-specific features while being sensitive to the object category, spatial location and scene context in a self-supervised manner. Specifically, we propose an architecture consisting of two modality-specific encoder streams that fuse intermediate encoder representations into a single decoder using our proposed self-supervised model adaptation fusion mechanism which optimally combines complementary features. As intermediate representations are not aligned across modalities, we introduce an attention scheme for better correlation. In addition, we propose a computationally efficient unimodal segmentation architecture termed AdapNet++ that incorporates a new encoder with multiscale residual units and an efficient atrous spatial pyramid pooling that has a larger effective receptive field with more than \(10\,\times \) fewer parameters, complemented with a strong decoder with a multi-resolution supervision scheme that recovers high-resolution details. Comprehensive empirical evaluations on Cityscapes, Synthia, SUN RGB-D, ScanNet and Freiburg Forest benchmarks demonstrate that both our unimodal and multimodal architectures achieve state-of-the-art performance while simultaneously being efficient in terms of parameters and inference time as well as demonstrating substantial robustness in adverse perceptual conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

References

  • Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.

  • Anwar, S., Hwang, K., & Sung, W. (2017). Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3), 32.

    Google Scholar 

  • Audebert, N., Le Saux, B., & Lefèvre, S. (2018). Beyond rgb: Very high resolution urban remote sensing with multimodal deep networks. ISPRS Journal of Photogrammetry and Remote Sensing, 140, 20–32.

    Article  Google Scholar 

  • Badrinarayanan, V., Kendall, A., & Cipolla, R. (2015). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561.

  • Boniardi, F., Valada, A., Mohan, R., Caselitz, T., & Burgard, W. (2019). Robot localization in floor plans using a room layout edge extraction network. arXiv preprint arXiv:1903.01804.

  • Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In D. Forsyth, P. Torr, & A. Zisserman (Eds.), Proceedings of the European conference on computer vision.

  • Bulò, S. R., Porzi, L., & Kontschieder, P. (2018). In-place activated batchnorm for memory-optimized training of dnns. In Proceedings of the conference on computer vision and pattern recognition.

  • Chattopadhyay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2017). Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. arXiv preprint arXiv:1710.11063.

  • Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arxiv preprint arXiv: 1606.00915.

  • Chen, L. C., Collins, M., Zhu, Y., Papandreou, G., Zoph, B., Schroff, F., Adam, H., & Shlens, J. (2018a). Searching for efficient multi-scale architectures for dense image prediction. In Advances in neural information processing systems (pp. 8713–8724).

  • Chen, L. C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.

  • Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018b). Encoder–decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611.

  • Chollet, F. (2016). Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357.

  • Cichy, R. M., Pantazis, D., & Oliva, A. (2016). Similarity-based fusion of meg and fmri reveals spatio-temporal dynamics in human cortex during visual object recognition. Cerebral Cortex, 26(8), 3563–3579.

    Article  Google Scholar 

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the conference on computer vision and pattern recognition.

  • Couprie, C., Farabet, C., Najman, L., & LeCun, Y. (2013). Indoor semantic segmentation using depth information. arXiv preprint arXiv:1301.3572.

  • Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser, T., & Nießner, M. (2017). Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the conference on computer vision and pattern recognition.

  • Dai, A., & Nießner, M. (2018). 3dmv: Joint 3d-multi-view prediction for 3d semantic scene segmentation. arXiv preprint arXiv:1803.10409.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings of the conference on computer vision and pattern recognition.

  • Eitel, A., Springenberg, J. T., Spinello, L., Riedmiller, M. A., & Burgard, W. (2015). Multimodal deep learning for robust rgb-d object recognition. In Proceedings of the IEEE/RSJ international conference on intelligent robots and systems.

  • Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.

    Article  Google Scholar 

  • Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2012). Scene parsing with multiscale feature learning, purity trees, and optimal covers. In Proceedings of the international conference on machine learning.

  • Fei-Fei, L., Koch, C., Iyer, A., & Perona, P. (2004). What do we see when we glance at a scene? Journal of Vision, 4(8), 863–863.

    Article  Google Scholar 

  • Fulkerson, B., Vedaldi, A., & Soatto, S. (2009). Class segmentation and object localization with superpixel neighborhoods. In Proceedings of the international conference on computer vision.

  • Ghiasi, G., & Fowlkes, C. C. (2016). Laplacian pyramid reconstruction and refinement for semantic segmentation. In European conference on computer vision (pp. 519–534).

  • Grangier, D., Bottou, L., & Collobert, R. (2009). Deep convolutional networks for scene parsing. In ICML workshop on deep learning.

  • Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from rgb-d images for object detection and segmentation. In Proceedings of the European conference on computer vision.

  • Hazirbas, C., Ma, L., Domokos, C., & Cremers, D. (2016). Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Proceedings of the Asian conference on computer vision.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015a). Deep residual learning for image recognition. In Proceedings of the conference on computer vision and pattern recognition.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015b). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the international conference on computer vision.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015c). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., Sun, J. (2016). Identity mappings in deep residual networks. In Proceedings of the European conference on computer vision (pp. 630–645).

  • Hermans, A., Floros, G., & Leibe, B. (2014). Dense 3d semantic mapping of indoor scenes from rgb-d images. In Proceedings of the IEEE international conference on robotics and automation.

  • Hu, J., Shen, L., & Sun, G. (2017). Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507.

  • Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the conference on computer vision and pattern recognition (pp. 7132–7141).

  • Huete, A., Justice, C., & Van Leeuwen, W. (1999). Modis vegetation index (mod13). Algorithm Theoretical Basis Document, 3, 213.

    Google Scholar 

  • Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., et al. (2013). A category-level 3d object dataset: Putting the kinect to work. In Proceedings of the IEEE international conference on consumer depth cameras for computer vision (pp. 141–165).

  • Kim, D. K., Maturana, D., Uenoyama, M., & Scherer, S. (2017). Season-invariant semantic segmentation with a deep multimodal network. In Field and service robotics.

  • Kohli, P., Torr, P. H., et al. (2009). Robust higher order potentials for enforcing label consistency. International Journal of Computer Vision, 82(3), 302–324.

    Article  Google Scholar 

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

  • Ku, J., Harakeh, A., & Waslander, S. L. (2018). In defense of classical image processing: Fast depth completion on the cpu. arXiv preprint arXiv:1802.00036.

  • LeCun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal brain damage. In Advances in neural information processing systems (pp. 598–605).

  • Lee, C. Y., Xie, S., Gallagher, P., Zhang, Z., & Tu, Z. (2015). Deeply-supervised nets. In Artificial intelligence and statistics (pp. 562–570).

  • Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710

  • Li, Z., Gan, Y., Liang, X., Yu, Y., Cheng, H., & Lin, L. (2016). Lstm-cf: Unifying context modeling and fusion with lstms for rgb-d scene labeling. In Proceedings of the European conference on computer vision.

  • Liang-Chieh, C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. (2015). Semantic image segmentation with deep convolutional nets and fully connected crfs. In International conference on learning representations.

  • Lin, G., Milan, A., Shen, C., & Reid, I. D. (2017). Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the conference on computer vision and pattern recognition.

  • Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400.

  • Liu, W., Rabinovich, A., & Berg, A. C. (2015). Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579.

  • Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., & Zhang, C. (2017). Learning efficient convolutional networks through network slimming. In Proceedings of the international conference on computer vision.

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the conference on computer vision and pattern recognition.

  • Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2017). Pruning convolutional neural networks for resource efficient inference. Proceedings of the international conference on learning representation.

  • Munoz, D., Bagnell, J. A., & Hebert, M. (2012). Co-inference for multi-modal scene analysis. In Proceedings of the European conference on computer vision.

  • Noh, H., Araujo, A., Sim, J., Weyand, T., & Han, B. (2017). Largescale image retrieval with attentive deep local features. In Proceedings of the IEEE International conference on computer vision (pp. 3456–3465).

  • Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In Proceedings of the international conference on computer vision (pp. 1520–1528).

  • Oliveira, G., Valada, A., Bollen, C., Burgard, W., & Brox, T. (2016). Deep learning for human part discovery in images. In Proceedings of the IEEE international conference on robotics and automation.

  • Paszke, A., Chaurasia, A., Kim, S., & Culurciello, E. (2016). Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147.

  • Pinheiro, P. O., & Collobert, R. (2014). Recurrent convolutional neural networks for scene labeling. In Proceedings of the international conference on machine learning.

  • Plath, N., Toussaint, M., & Nakajima, S. (2009). Multi-class image segmentation using conditional random fields and global classification. In Proceedings of the international conference on machine learning.

  • Qi, X., Liao, R., Jia, J., Fidler, S., & Urtasun, R. (2017). 3d graph neural networks for rgbd semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5199–5208).

  • Radwan, N., Valada, A., & Burgard, W. (2018a). Multimodal interaction-aware motion prediction for autonomous street crossing. arXiv preprint arXiv:1808.06887.

  • Radwan, N., Valada, A., & Burgard, W. (2018b). Vlocnet++: Deep multitask learning for semantic visual localization and odometry. IEEE Robotics and Automation Letters (RA-L), 3(4), 4407–4414.

    Article  Google Scholar 

  • Ren, X., Bo, L., & Fox, D. (2012). Rgb-(d) scene labeling: Features and algorithms. In Proceedings of the conference on computer vision and pattern recognition.

  • Romera, E., Alvarez, J. M., Bergasa, L. M., & Arroyo, R. (2018). Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1), 263–272.

    Article  Google Scholar 

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 234–241).

  • Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A. M. (2016). The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the conference on computer vision and pattern recognition.

  • Running, S. W., Nemani, R., Glassy, J. M., & Thornton, P. E. (1999). Modis daily photosynthesis (psn) and annual net primary production (npp) product (mod17) algorithm theoretical basis document. University of Montana, SCF At-Launch Algorithm ATBD Documents.

  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the conference on computer vision and pattern recognition (pp. 4510–4520).

  • Schneider, L., Jasch, M., Fröhlich, B., Weber, T., Franke, U., Pollefeys, M., et al. (2017). Multimodal neural networks: Rgb-d for semantic segmentation and object detection. In P. Sharma & F. M. Bianchi (Eds.), Image analysis (pp. 98–109). Cham: Springer.

    Chapter  Google Scholar 

  • Shotton, J., Johnson, M., & Cipolla, R. (2008). Semantic texton forests for image categorization and segmentation. In Proceedings of the conference on computer vision and pattern recognition.

  • Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In Proceedings of the European conference on computer vision.

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

  • Song, S., Lichtenberg, S. P., & Xiao, J. (2015). Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the conference on computer vision and pattern recognition (Vol. 5, p. 6).

  • Sturgess, P., Alahari, K., Ladicky, L., & Torr, P. H. S. (2009). Combining appearance and structure from motion features for road scene understanding. In Proceedings of the British machine vision conference.

  • Valada, A., Dhall, A., & Burgard, W. (2016a). Convoluted mixture of deep experts for robust semantic segmentation. In IEEE/RSJ International conference on intelligent robots and systems (IROS) workshop, state estimation and terrain perception for all terrain mobile robots.

  • Valada, A., Oliveira, G., Brox, T., & Burgard, W. (2016b). Deep multispectral semantic scene understanding of forested environments using multimodal fusion. In Proceedings of the international symposium for experimental robotics.

  • Valada, A., Oliveira, G., Brox, T., & Burgard, W. (2016c). Towards robust semantic segmentation using deep fusion. In Robotics: Science and systems (RSS 2016) workshop, are the sceptics right? Limits and potentials of deep learning in robotics.

  • Valada, A., Vertens, J., Dhall, A., & Burgard, W. (2017). Adapnet: Adaptive semantic segmentation in adverse environmental conditions. In Proceedings of the IEEE international conference on robotics and automation.

  • Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016). Learning structured sparsity in deep neural networks. In Advances in neural information processing systems (pp. 2074–2082).

  • Xiang, Y., & Fox, D. (2017). Da-rnn: Semantic mapping with data associated recurrent neural networks. arXiv preprint arXiv:1703.03098.

  • Xiao, J., Owens, A., & Torralba, A. (2013). Sun3d: A database of big spaces reconstructed using sfm and object labels. In Proceedings of the international conference on computer vision.

  • Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the conference on computer vision and pattern recognition (pp. 5987–5995).

  • Yang, M., Yu, K., Zhang, C., Li, Z., & Yang, K. (2018). Denseaspp for semantic segmentation in street scenes. In Proceedings of the conference on computer vision and pattern recognition (pp. 3684–3692).

  • Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In International conference on learning representations.

  • Zhang, C., Wang, L., & Yang, R. (2010). Semantic segmentation of urban scenes using dense depth maps. In K. Daniilidis, P. Maragos, & N. Paragios (Eds.), Proceedings of the European conference on computer vision.

  • Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the conference on computer vision and pattern recognition.

  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2014). Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856.

  • Zhuang, Y., Yang, F., Tao, L., Ma, C., Zhang, Z., Li, Y., et al. (2018). Dense relation network: Learning consistent and context-aware representation for semantic image segmentation. In 2018 25th IEEE international conference on image processing (ICIP) (pp. 3698–3702).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abhinav Valada.

Additional information

Communicated by Anelia Angelova, Gustavo Carneiro, Niko Sünderhauf, Jürgen Leitner.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Valada, A., Mohan, R. & Burgard, W. Self-Supervised Model Adaptation for Multimodal Semantic Segmentation. Int J Comput Vis 128, 1239–1285 (2020). https://doi.org/10.1007/s11263-019-01188-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-019-01188-y

Keywords

Navigation