Malleable 2.5D Convolution: Learning Receptive Fields Along the Depth-Axis for RGB-D Scene Parsing

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12364)


Depth data provide geometric information that can bring progress in RGB-D scene parsing tasks. Several recent works propose RGB-D convolution operators that construct receptive fields along the depth-axis to handle 3D neighborhood relations between pixels. However, these methods pre-define depth receptive fields by hyperparameters, making them rely on parameter selection. In this paper, we propose a novel operator called malleable 2.5D convolution to learn the receptive field along the depth-axis. A malleable 2.5D convolution has one or more 2D convolution kernels. Our method assigns each pixel to one of the kernels or none of them according to their relative depth differences, and the assigning process is formulated as a differentiable form so that it can be learnt by gradient descent. The proposed operator runs on standard 2D feature maps and can be seamlessly incorporated into pre-trained CNNs. We conduct extensive experiments on two challenging RGB-D semantic segmentation dataset NYUDv2 and Cityscapes to validate the effectiveness and the generalization ability of our method.


RGB-D scene parsing Geometry in CNN Malleable 2.5D convolution 



This work is supported by the National Key Research and Development Program of China (2017YFB1002601, 2016QY02D0304), National Natural Science Foundation of China (61375022, 61403005, 61632003), Beijing Advanced Innovation Center for Intelligent Robots and Systems (2018IRS11), and PEK-SenseTime Joint Laboratory of Machine Vision.

Supplementary material

504475_1_En_33_MOESM1_ESM.pdf (7.3 mb)
Supplementary material 1 (pdf 7451 KB)


  1. 1.
    Chen, L., et al.: Searching for efficient multi-scale architectures for dense image prediction. In: NeurIPS, pp. 8713–8724 (2018)Google Scholar
  2. 2.
    Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)CrossRefGoogle Scholar
  3. 3.
    Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587 (2017)Google Scholar
  4. 4.
    Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). Scholar
  5. 5.
    Chen, Y., Mensink, T., Gavves, E.: 3D neighborhood convolution: learning depth-aware features for RGB-D and RGB semantic segmentation. In: 3DV, pp. 173–182. IEEE (2019)Google Scholar
  6. 6.
    Cheng, Y., Cai, R., Li, Z., Zhao, X., Huang, K.: Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation. In: CVPR, pp. 1475–1483. IEEE Computer Society (2017)Google Scholar
  7. 7.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June, 2016, pp. 3213–3223. IEEE Computer Society (2016).,
  8. 8.
    Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: ICCV, pp. 764–773. IEEE Computer Society (2017)Google Scholar
  9. 9.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning Rich Features from RGB-D Images for Object Detection and Segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). Scholar
  10. 10.
    Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10111, pp. 213–228. Springer, Cham (2017). Scholar
  11. 11.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE Computer Society (2016)Google Scholar
  12. 12.
    Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NIPS, pp. 2017–2025 (2015)Google Scholar
  13. 13.
    Kang, B., Lee, Y., Nguyen, T.Q.: Depth-adaptive deep neural network for semantic segmentation. IEEE Trans. Multimed. 20(9), 2478–2490 (2018)CrossRefGoogle Scholar
  14. 14.
    Kong, S., Fowlkes, C.C.: Recurrent scene parsing with perspective understanding in the loop. In: CVPR, pp. 956–965. IEEE Computer Society (2018)Google Scholar
  15. 15.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)Google Scholar
  16. 16.
    Lee, S., Park, S., Hong, K.: RDFNet: RGB-D multi-level residual feature fusion for indoor semantic segmentation. In: ICCV, pp. 4990–4999. IEEE Computer Society (2017)Google Scholar
  17. 17.
    Li, Z., Gan, Y., Liang, X., Yu, Y., Cheng, H., Lin, L.: LSTM-CF: unifying context modeling and fusion with LSTMs for RGB-D scene labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 541–557. Springer, Cham (2016). Scholar
  18. 18.
    Lin, D., Chen, G., Cohen-Or, D., Heng, P., Huang, H.: Cascaded feature network for semantic segmentation of RGB-D images. In: ICCV, pp. 1320–1328. IEEE Computer Society (2017)Google Scholar
  19. 19.
    Lin, G., Milan, A., Shen, C., Reid, I.D.: RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In: CVPR, pp. 5168–5177. IEEE Computer Society (2017)Google Scholar
  20. 20.
    Liu, H., Simonyan, K., Yang, Y.: DARTS: differentiable architecture search. In: ICLR (Poster). (2019)Google Scholar
  21. 21.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440. IEEE Computer Society (2015)Google Scholar
  22. 22.
    Paszke, A., et al.: Automatic differentiation in PyTorch. In: NIPS Autodiff Workshop (2017)Google Scholar
  23. 23.
    Qi, X., Liao, R., Jia, J., Fidler, S., Urtasun, R.: 3D graph neural networks for RGBD semantic segmentation. In: ICCV, pp. 5209–5218. IEEE Computer Society (2017)Google Scholar
  24. 24.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., Li, F.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2017)CrossRefGoogle Scholar
  26. 26.
    Shen, F., Gan, R., Yan, S., Zeng, G.: Semantic segmentation via structured patch prediction, context CRF and guidance CRF. In: CVPR, pp. 5178–5186. IEEE Computer Society (2017)Google Scholar
  27. 27.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). Scholar
  28. 28.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar
  29. 29.
    Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.A.: Semantic scene completion from a single depth image. In: CVPR, pp. 190–198. IEEE Computer Society (2017)Google Scholar
  30. 30.
    Tan, M., Le, Q.V.: MixConv: mixed depthwise convolutional kernels. CoRR abs/1907.09595 (2019)Google Scholar
  31. 31.
    Wang, W., Neumann, U.: Depth-aware CNN for RGB-D segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 144–161. Springer, Cham (2018). Scholar
  32. 32.
    Wu, Z., Shen, C., van den Hengel, A.: High-performance semantic segmentation using very deep fully convolutional networks. CoRR abs/1604.04339 (2016)Google Scholar
  33. 33.
    Xie, S., Zheng, H., Liu, C., Lin, L.: SNAS: stochastic neural architecture search. In: ICLR (Poster). (2019)Google Scholar
  34. 34.
    Xing, Y., Wang, J., Chen, X., Zeng, G.: 2.5D convolution for RGB-D semantic segmentation. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 1410–1414. IEEE (2019)Google Scholar
  35. 35.
    Xing, Y., Wang, J., Chen, X., Zeng, G.: Coupling two-stream RGB-D semantic segmentation network by idempotent mappings. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 1850–1854. IEEE (2019)Google Scholar
  36. 36.
    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. CoRR abs/1511.07122 (2015)Google Scholar
  37. 37.
    Zhang, R., Tang, S., Zhang, Y., Li, J., Yan, S.: Scale-adaptive convolutions for scene parsing. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October, 2017, pp. 2050–2058. IEEE Computer Society (2017).
  38. 38.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239. IEEE Computer Society (2017)Google Scholar
  39. 39.
    Zhong, Y., Dai, Y., Li, H.: 3D geometry-aware semantic labeling of outdoor street scenes. In: ICPR, pp. 2343–2349. IEEE Computer Society (2018)Google Scholar
  40. 40.
    Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convNets V2: more deformable, better results. In: CVPR, pp. 9308–9316. Computer Vision Foundation/IEEE (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Key Laboratory of Machine PerceptionPeking UniversityBeijingChina
  2. 2.The Chinese University of Hong KongShatinHong Kong

Personalised recommendations