Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12356)


Depth information has proven to be a useful cue in the semantic segmentation of RGB-D images for providing a geometric counterpart to the RGB representation. Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion to obtain better feature representations to achieve more accurate segmentation. This, however, may not lead to satisfactory results as actual depth data are generally noisy, which might worsen the accuracy as the networks go deeper.

In this paper, we propose a unified and efficient Cross-modality Guided Encoder to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively. The key of the proposed architecture is a novel Separation-and-Aggregation Gating operation that jointly filters and recalibrates both representations before cross-modality aggregation. Meanwhile, a Bi-direction Multi-step Propagation strategy is introduced, on the one hand, to help to propagate and fuse information between the two modalities, and on the other hand, to preserve their specificity along the long-term propagation process. Besides, our proposed encoder can be easily injected into the previous encoder-decoder structures to boost their performance on RGB-D semantic segmentation. Our model outperforms state-of-the-arts consistently on both in-door and out-door challenging datasets (Code of this work is available at


RGB-D semantic segmentation Cross-modality feature propagation 



This work is supported by the National Key Research and Development Program of China (2017YFB1002601, 2016QY02D0304), National Natural Science Foundation of China (61375022, 61403005, 61632003), Beijing Advanced Innovation Center for Intelligent Robots and Systems (2018IRS11), and PEK-SenseTime Joint Laboratory of Machine Vision.

Supplementary material

504452_1_En_33_MOESM1_ESM.pdf (19.4 mb)
Supplementary material 1 (pdf 19877 KB)


  1. 1.
    Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
  2. 2.
    Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). Scholar
  3. 3.
    Chen, X., Lin, K.Y., Qian, C., Zeng, G., Li, H.: 3D sketch-aware semantic scene completion via semi-supervised structure prior. In: CVPR (2020)Google Scholar
  4. 4.
    Chen, Y., Mensink, T., Gavves, E.: 3D neighborhood convolution: learning depth-aware features for RGB-D and RGB semantic segmentation. In: 3DV. IEEE (2019)Google Scholar
  5. 5.
    Cheng, B., et al.: SPGNet: semantic prediction guidance for scene parsing. In: ICCV (2019)Google Scholar
  6. 6.
    Cheng, Y., Cai, R., Li, Z., Zhao, X., Huang, K.: Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation. In: CVPR (2017)Google Scholar
  7. 7.
    Choi, S., Kim, J.T., Choo, J.: Cars can’t fly up in the sky: improving urban-scene segmentation via height-driven attention networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)Google Scholar
  8. 8.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)Google Scholar
  9. 9.
    Deng, L., Yang, M., Li, T., He, Y., Wang, C.: RFBNet: deep multimodal networks with residual fusion blocks for RGB-D semantic segmentation. arXiv preprint arXiv:1907.00135 (2019)
  10. 10.
    Ding, H., Jiang, X., Liu, A., Thalmann, N.M., Wang, G.: Boundary-aware feature propagation for scene segmentation. In: ICCV (2019)Google Scholar
  11. 11.
    Fu, J., et al.: Dual attention network for scene segmentation. In: CVPR (2019)Google Scholar
  12. 12.
    Fu, J., et al.: Adaptive context network for scene parsing. In: ICCV (2019)Google Scholar
  13. 13.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). Scholar
  14. 14.
    He, J., Deng, Z., Qiao, Y.: Dynamic multi-scale filters for semantic segmentation. In: CVPR (2019)Google Scholar
  15. 15.
    He, Y., Chiu, W.C., Keuper, M., Fritz, M.: STD2P: RGBD semantic segmentation using spatio-temporal data-driven pooling. In: ICCV (2017)Google Scholar
  16. 16.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018)Google Scholar
  17. 17.
    Hu, X., Yang, K., Fei, L., Wang, K.: ACNet: attention based network to exploit complementary features for RGBD semantic segmentation. arXiv preprint arXiv:1905.10089 (2019)
  18. 18.
    Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: CCNet: criss-cross attention for semantic segmentation. In: ICCV (2019)Google Scholar
  19. 19.
    Hung, S.W., Lo, S.Y., Hang, H.M.: Incorporating luminance, depth and color information by a fusion-based network for semantic segmentation. In: ICIP. IEEE (2019)Google Scholar
  20. 20.
    Kong, S., Fowlkes, C.C.: Recurrent scene parsing with perspective understanding in the loop. In: CVPR (2018)Google Scholar
  21. 21.
    Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: CVPR (2019)Google Scholar
  22. 22.
    Li, Z., Gan, Y., Liang, X., Yu, Y., Cheng, H., Lin, L.: LSTM-CF: unifying context modeling and fusion with LSTMs for RGB-D scene labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 541–557. Springer, Cham (2016). Scholar
  23. 23.
    Lin, D., Chen, G., Cohen-Or, D., Heng, P.A., Huang, H.: Cascaded feature network for semantic segmentation of RGB-D images. In: ICCV (2017)Google Scholar
  24. 24.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  25. 25.
    Park, S.J., Hong, K.S., Lee, S.: RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation. In: ICCV (2017)Google Scholar
  26. 26.
    Qi, X., Liao, R., Jia, J., Fidler, S., Urtasun, R.: 3D graph neural networks for RGBD semantic segmentation. In: ICCV (2017)Google Scholar
  27. 27.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). Scholar
  28. 28.
    Takikawa, T., Acuna, D., Jampani, V., Fidler, S.: Gated-SCNN: gated shape CNNs for semantic segmentation (2019)Google Scholar
  29. 29.
    Wang, F., et al.: Residual attention network for image classification. In: CVPR (2017)Google Scholar
  30. 30.
    Wang, J., Wang, Z., Tao, D., See, S., Wang, G.: Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 664–679. Springer, Cham (2016). Scholar
  31. 31.
    Wang, W., Neumann, U.: Depth-aware CNN for RGB-D segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 144–161. Springer, Cham (2018). Scholar
  32. 32.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)Google Scholar
  33. 33.
    Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). Scholar
  34. 34.
    Wu, H., Zhang, J., Huang, K., Liang, K., Yu, Y.: FastFCN: rethinking dilated convolution in the backbone for semantic segmentation. arXiv preprint arXiv:1903.11816 (2019)
  35. 35.
    Xing, Y., Wang, J., Chen, X., Zeng, G.: 2.5D convolution for RGB-D semantic segmentation. In: ICIP. IEEE (2019)Google Scholar
  36. 36.
    Xing, Y., Wang, J., Chen, X., Zeng, G.: Coupling two-stream RGB-D semantic segmentation network by idempotent mappings. In: ICIP. IEEE (2019)Google Scholar
  37. 37.
    Xu, D., Ouyang, W., Wang, X., Sebe, N.: PAD-Net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: CVPR (2018)Google Scholar
  38. 38.
    Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: DenseASPP for semantic segmentation in street scenes. In: CVPR (2018)Google Scholar
  39. 39.
    Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Learning a discriminative feature network for semantic segmentation. In: CVPR (2018)Google Scholar
  40. 40.
    Yuan, Y., Wang, J.: OCNet: object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018)
  41. 41.
    Zhang, F., et al.: ACFNet: attentional class feature network for semantic segmentation. In: ICCV (2019)Google Scholar
  42. 42.
    Zhang, Z., Cui, Z., Xu, C., Yan, Y., Sebe, N., Yang, J.: Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In: CVPR (2019)Google Scholar
  43. 43.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Key Laboratory of Machine Perception (MOE), School of EECSPeking UniversityBeijingChina
  2. 2.SenseTime ResearchTai PoHong Kong
  3. 3.The Chinese University of Hong KongSha TinHong Kong

Personalised recommendations