Skip to main content
Log in

MMPL-Net: multi-modal prototype learning for one-shot RGB-D segmentation

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

For one-shot segmentation, prototype learning is extensively used. However, using only one RGB prototype to represent all information in the support image may lead to ambiguities. To this end, we propose a one-shot segmentation network based on multi-modal prototype learning that uses depth information to complement RGB information. Specifically, we propose a multi-modal fusion and refinement block (MFRB) and multi-modal prototype learning block (MPLB). MFRB fuses RGB and depth features to generate multi-modal features and refined depth features, which are used by MPLB, to generate multi-modal information prototypes, depth information prototypes, and global information prototypes. Furthermore, we introduce self-attention to capture global context information in RGB and depth images. By integrating self-attention, MFRB, and MPLB, we propose the multi-modal prototype learning network (MMPL-Net), which adapts to the ambiguity of visual information in the scene. Finally, we construct a one-shot RGB-D segmentation dataset called OSS-RGB-D-5\(^i\). Experiments using OSS-RGB-D-5\(^i\) show that our proposed method outperforms several state-of-the-art techniques with fewer labeled images and generalizes well to previously unseen objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the corresponding author [YZ], upon reasonable request.

References

  1. Bao Y et al (2021) Visible and thermal images fusion architecture for few-shot semantic segmentation. J Vis Commun Image Represent 80:103306. https://doi.org/10.1016/j.jvcir.2021.103306

    Article  Google Scholar 

  2. Bachmann R, Mizrahi D, Atanov A, Zamir A (2022) Multimae: Multi-modal multi-task masked autoencoders. arXiv preprint arXiv:2204.01678

  3. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495

    Article  Google Scholar 

  4. Cai Z, Shao L (2017) Rgb-d data fusion in complex space. In: 2017 IEEE International Conference on Image Processing (ICIP), pp 1965–1969

  5. Cao J, Leng H, Lischinski D, Cohen-Or D, Tu C, Li Y (2021) Shapeconv: shape-aware convolutional layer for indoor rgb-d semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 7088–7097

  6. Chen H, Deng Y, Li Y, Hung TY, Lin G (2020) Rgbd salient object detection via disentangled cross-modal fusion. IEEE Trans Image Process 29:8407–8416

    Article  MATH  Google Scholar 

  7. Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV) pp 801–818

  8. Chen X, Lin KY, Wang J, Wu W, Qian C, Li H, Zeng G (2020) Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation. In: ECCV

  9. Dong N, Xing EP (2018) Few-shot semantic segmentation with prototype learning. In: British Machine Vision Conference vol 3

  10. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  11. El Madawi K, Rashed H, El Sallab A, Nasr O, Kamel H, Yogamani S (2019) Rgb and lidar fusion based 3d semantic segmentation for autonomous driving. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp 7–12 https://doi.org/10.1109/ITSC.2019.8917447

  12. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 3146–3154

  13. Hazirbas C, Ma L, Domokos C, Cremers D (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In: Asian conference on computer vision, Springer, pp 213–228

  14. Hazirbas C, Ma L, Domokos C, Cremers D (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In: ACCV

  15. He J, Deng Z, Zhou L, Wang Y, Qiao Y (2019) Adaptive pyramid context network for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp 7511–7520

  16. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  17. Hu X, Yang K, Fei L, Wang K (2019) Acnet: attention based network to exploit complementary features for rgbd semantic segmentation. In: 2019 IEEE International Conference on Image Processing (ICIP), pp 1440–1444. https://doi.org/10.1109/ICIP.2019.8803025

  18. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp 448–456

  19. Ju R, Ge L, Geng W, Ren T, Wu G (2014) Depth saliency based on anisotropic center-surround difference. In: 2014 IEEE international conference on image processing (ICIP), pp 1115–1119

  20. Krispel G, Opitz M, Waltner G, Possegger H, Bischof H (2020) Fuseseg: lidar point cloud segmentation fusing multi-modal data. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 1863–1872. https://doi.org/10.1109/WACV45572.2020.9093584

  21. Levin A, Lischinski D, Weiss Y (2004) Colorization using optimization. In: ACM SIGGRAPH 2004, pp 689–694

  22. Li G, Jampani V, Sevilla-Lara L, Sun D, Kim J, Kim J (2021) Adaptive prototype learning and allocation for few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8334–8343

  23. Li X, Zhong Z, Wu J, Yang Y, Lin Z, Liu H (2019) Expectation-maximization attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9167–9176

  24. Lin D, Chen G, Cohen-Or D, Heng PA, Huang H (2017) Cascaded feature network for semantic segmentation of rgb-d images. In: Proceedings of the IEEE international conference on computer vision, pp 1311–1319

  25. Liu B, Jiao J, Ye Q (2021) Harmonic feature activation for few-shot semantic segmentation. IEEE Trans Image Process 30:3142–3153

    Article  Google Scholar 

  26. Liu H, Zhang J, Yang K, Hu X, Stiefelhagen R (2022) Cmx: cross-modal fusion for rgb-x semantic segmentation with transformers. arXiv preprint arXiv:abs/2203.04838

  27. Liu N, Zhang N, Shao L, Han J (2020) Learning selective mutual attention and contrast for rgb-d saliency detection. arXiv preprint arXiv:2010.05537

  28. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  29. Ma L, Stückler J, Kerl C, Cremers D (2017) Multi-view deep learning for consistent semantic mapping with rgb-d cameras. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp 598–605

  30. Min J, Kang D, Cho M (2021) Hypercorrelation squeeze for few-shot segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

  31. Park SJ, Hong KS, Lee S (2017) Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In: Proceedings of the IEEE international conference on computer vision, pp 4980–4989

  32. Pei J, Cheng T, Fan DP, Tang H, Chen C, Van Gool L (2022) Osformer: one-stage camouflaged instance segmentation with transformers. arXiv preprint arXiv:2207.02255

  33. Peng H, Li B, Xiong W, Hu W, Ji R (2014) Rgbd salient object detection: a benchmark and algorithms. In: European conference on computer vision, Springer, pp 92–109

  34. Piao Y, Ji W, Li J, Zhang M, Lu H (2019) Depth-induced multi-scale recurrent attention network for saliency detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 7254–7263

  35. Piao Y, Rong Z, Zhang M, Ren W, Lu H (2020) A2dele: adaptive and attentive depth distiller for efficient rgb-d salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9060–9069

  36. Prakash A, Chitta K, Geiger A (2021) Multi-modal fusion transformer for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 7077–7087

  37. Ren L, Duan G, Huang T, Kang Z (2022) Multi-local feature relation network for few-shot learning. Neural Comput Appl 1–11

  38. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, Springer, pp 234–241

  39. Sa L, Yu C, Ma X, Zhao X, Xie T (2022) Attentive fine-grained recognition for cross-domain few-shot classification. Neural Comput Appl 34(6):4733–4746

    Article  Google Scholar 

  40. Sankaran S, Yang D, Lim S (2021) Multimodal fusion refiner networks. CoRR abs/2104.03435. arXiv:2104.03435

  41. Shaban A, Bansal S, Liu Z, Essa I, Boots B (2017) One-shot learning for semantic segmentation. arXiv preprint arXiv:abs/1709.03410

  42. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  43. Song X, Jiang S, Herranz L, Chen C (2018) Learning effective rgb-d representations for scene recognition. IEEE Trans Image Process 28(2):980–993

    Article  MATH  MathSciNet  Google Scholar 

  44. Sun L, Yang K, Hu X, Hu W, Wang K (2020) Real-time fusion network for rgb-d semantic segmentation incorporating unexpected obstacle detection for road-driving images. IEEE Robot Autom Lett 5(4):5558–5565. https://doi.org/10.1109/LRA.2020.3007457

    Article  Google Scholar 

  45. Tao A, Sapra K, Catanzaro B (2020) Hierarchical multi-scale attention for semantic segmentation. arXiv preprint arXiv:2005.10821

  46. Tian Z, Zhao H, Shu M, Yang Z, Li R, Jia J (2022) Prior guided feature enrichment network for few-shot segmentation. IEEE Trans Pattern Anal Mach Intell 44(2):1050–1065. https://doi.org/10.1109/TPAMI.2020.3013717

    Article  Google Scholar 

  47. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  48. Wang H, Zhang X, Hu Y, Yang Y, Cao X, Zhen X (2020) Few-shot semantic segmentation with democratic attention networks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, Springer, pp 730–746

  49. Wang K, Liew JH, Zou Y, Zhou D, Feng J (2019) Panet: few-shot image semantic segmentation with prototype alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9197–9206

  50. Wang P, Cheng J, Hao F, Wang L, Feng W (2020) Embedded adaptive cross-modulation neural network for few-shot learning. Neural Comput Appl 32(10):5505–5515

    Article  Google Scholar 

  51. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803

  52. Wang Y, Chen X, Cao L, Huang W, Sun F, Wang Y (2022) Multimodal token fusion for vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12186–12195

  53. Wang Y, Chen X, Cao L, Huang W, Sun F, Wang Y (2022) Multimodal token fusion for vision transformers. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  54. Xiao Y, Codevilla F, Gurram A, Urfalioglu O, López AM (2022) Multimodal end-to-end autonomous driving. IEEE Trans Intell Transp Syst 23(1):537–547. https://doi.org/10.1109/TITS.2020.3013234

    Article  Google Scholar 

  55. Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P (2021) Segformer: simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst 34:12077–12090

    Google Scholar 

  56. Yang B, Liu C, Li B, Jiao J, Ye Q (2020) Prototype mixture models for few-shot semantic segmentation. In: European Conference on Computer Vision, Springer, pp 763–778

  57. Zhang C, Lin G, Liu F, Guo J, Wu Q, Yao R (2019) Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9587–9595

  58. Zhang C, Lin G, Liu F, Yao R, Shen C (2019) Canet: class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5217–5226

  59. Zhang J, Yang K, Constantinescu A, Peng K, Müller K, Stiefelhagen R (2021) Trans4trans: efficient transformer for transparent object segmentation to help visually impaired people navigate in the real world. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1760–1770

  60. Zhang X, Wei Y, Yang Y, Huang TS (2020) Sg-one: similarity guidance network for one-shot semantic segmentation. IEEE Trans Cybern 50(9):3855–3865

    Article  Google Scholar 

  61. Zhang Y, Sidibé D, Morel O, Meriaudeau F (2021) Incorporating depth information into few-shot semantic segmentation. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp 3582–3588. https://doi.org/10.1109/ICPR48806.2021.9412921

  62. Zhang Y, Sidibé D, Morel O, Meriaudeau F (2021) Incorporating depth information into few-shot semantic segmentation. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp 3582–3588

  63. Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890

  64. Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr PH, et al (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6881–6890

  65. Zhu Z, Xu M, Bai S, Huang T, Bai X (2019) Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 593–602

  66. Zhuang Z, Li R, Jia K, Wang Q, Li Y, Tan M (2021) Perception-aware multi-sensor fusion for 3d lidar semantic segmentation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp 16260–16270. https://doi.org/10.1109/ICCV48922.2021.01597

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (No. 61973066), Major Science and Technology Projects of Liaoning Province (No. 2021JH1/10400049), Foundation of Key Laboratory of Equipment Reliability (No. D2C20205500306), Foundation of Key Laboratory of Aerospace System Simulation (No. 6142002200301).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunzhou Zhang.

Ethics declarations

Conflict of interest

The authors declare that they do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shan, D., Zhang, Y., Liu, X. et al. MMPL-Net: multi-modal prototype learning for one-shot RGB-D segmentation. Neural Comput & Applic 35, 10297–10310 (2023). https://doi.org/10.1007/s00521-023-08235-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-08235-3

Keywords

Navigation