MMPL-Net: multi-modal prototype learning for one-shot RGB-D segmentation

Shan, Dexing; Zhang, Yunzhou; Liu, Xiaozheng; Liu, Shitong; Coleman, Sonya A.; Kerr, Dermot

doi:10.1007/s00521-023-08235-3

MMPL-Net: multi-modal prototype learning for one-shot RGB-D segmentation

Original Article
Published: 28 February 2023

Volume 35, pages 10297–10310, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Dexing Shan¹,
Yunzhou Zhang ORCID: orcid.org/0000-0003-0610-3732¹,
Xiaozheng Liu¹,
Shitong Liu¹,
Sonya A. Coleman² &
…
Dermot Kerr²

594 Accesses
2 Citations
Explore all metrics

Abstract

For one-shot segmentation, prototype learning is extensively used. However, using only one RGB prototype to represent all information in the support image may lead to ambiguities. To this end, we propose a one-shot segmentation network based on multi-modal prototype learning that uses depth information to complement RGB information. Specifically, we propose a multi-modal fusion and refinement block (MFRB) and multi-modal prototype learning block (MPLB). MFRB fuses RGB and depth features to generate multi-modal features and refined depth features, which are used by MPLB, to generate multi-modal information prototypes, depth information prototypes, and global information prototypes. Furthermore, we introduce self-attention to capture global context information in RGB and depth images. By integrating self-attention, MFRB, and MPLB, we propose the multi-modal prototype learning network (MMPL-Net), which adapts to the ambiguity of visual information in the scene. Finally, we construct a one-shot RGB-D segmentation dataset called OSS-RGB-D-5\(^i\). Experiments using OSS-RGB-D-5\(^i\) show that our proposed method outperforms several state-of-the-art techniques with fewer labeled images and generalizes well to previously unseen objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Few-Shot Segmentation via Complementary Prototype Learning and Cascaded Refinement

Few-Shot Segmentation via Rich Prototype Generation and Recurrent Prediction Enhancement

Few-shot semantic segmentation via multi-level feature extraction and multi-prototype localization

Article 09 November 2023

Data availability

The data that support the findings of this study are available from the corresponding author [YZ], upon reasonable request.

References

Bao Y et al (2021) Visible and thermal images fusion architecture for few-shot semantic segmentation. J Vis Commun Image Represent 80:103306. https://doi.org/10.1016/j.jvcir.2021.103306
Article Google Scholar
Bachmann R, Mizrahi D, Atanov A, Zamir A (2022) Multimae: Multi-modal multi-task masked autoencoders. arXiv preprint arXiv:2204.01678
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495
Article Google Scholar
Cai Z, Shao L (2017) Rgb-d data fusion in complex space. In: 2017 IEEE International Conference on Image Processing (ICIP), pp 1965–1969
Cao J, Leng H, Lischinski D, Cohen-Or D, Tu C, Li Y (2021) Shapeconv: shape-aware convolutional layer for indoor rgb-d semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 7088–7097
Chen H, Deng Y, Li Y, Hung TY, Lin G (2020) Rgbd salient object detection via disentangled cross-modal fusion. IEEE Trans Image Process 29:8407–8416
Article MATH Google Scholar
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV) pp 801–818
Chen X, Lin KY, Wang J, Wu W, Qian C, Li H, Zeng G (2020) Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation. In: ECCV
Dong N, Xing EP (2018) Few-shot semantic segmentation with prototype learning. In: British Machine Vision Conference vol 3
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
El Madawi K, Rashed H, El Sallab A, Nasr O, Kamel H, Yogamani S (2019) Rgb and lidar fusion based 3d semantic segmentation for autonomous driving. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp 7–12 https://doi.org/10.1109/ITSC.2019.8917447
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 3146–3154
Hazirbas C, Ma L, Domokos C, Cremers D (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In: Asian conference on computer vision, Springer, pp 213–228
Hazirbas C, Ma L, Domokos C, Cremers D (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In: ACCV
He J, Deng Z, Zhou L, Wang Y, Qiao Y (2019) Adaptive pyramid context network for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp 7511–7520
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hu X, Yang K, Fei L, Wang K (2019) Acnet: attention based network to exploit complementary features for rgbd semantic segmentation. In: 2019 IEEE International Conference on Image Processing (ICIP), pp 1440–1444. https://doi.org/10.1109/ICIP.2019.8803025
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp 448–456
Ju R, Ge L, Geng W, Ren T, Wu G (2014) Depth saliency based on anisotropic center-surround difference. In: 2014 IEEE international conference on image processing (ICIP), pp 1115–1119
Krispel G, Opitz M, Waltner G, Possegger H, Bischof H (2020) Fuseseg: lidar point cloud segmentation fusing multi-modal data. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 1863–1872. https://doi.org/10.1109/WACV45572.2020.9093584
Levin A, Lischinski D, Weiss Y (2004) Colorization using optimization. In: ACM SIGGRAPH 2004, pp 689–694
Li G, Jampani V, Sevilla-Lara L, Sun D, Kim J, Kim J (2021) Adaptive prototype learning and allocation for few-shot segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8334–8343
Li X, Zhong Z, Wu J, Yang Y, Lin Z, Liu H (2019) Expectation-maximization attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9167–9176
Lin D, Chen G, Cohen-Or D, Heng PA, Huang H (2017) Cascaded feature network for semantic segmentation of rgb-d images. In: Proceedings of the IEEE international conference on computer vision, pp 1311–1319
Liu B, Jiao J, Ye Q (2021) Harmonic feature activation for few-shot semantic segmentation. IEEE Trans Image Process 30:3142–3153
Article Google Scholar
Liu H, Zhang J, Yang K, Hu X, Stiefelhagen R (2022) Cmx: cross-modal fusion for rgb-x semantic segmentation with transformers. arXiv preprint arXiv:abs/2203.04838
Liu N, Zhang N, Shao L, Han J (2020) Learning selective mutual attention and contrast for rgb-d saliency detection. arXiv preprint arXiv:2010.05537
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Ma L, Stückler J, Kerl C, Cremers D (2017) Multi-view deep learning for consistent semantic mapping with rgb-d cameras. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp 598–605
Min J, Kang D, Cho M (2021) Hypercorrelation squeeze for few-shot segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Park SJ, Hong KS, Lee S (2017) Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In: Proceedings of the IEEE international conference on computer vision, pp 4980–4989
Pei J, Cheng T, Fan DP, Tang H, Chen C, Van Gool L (2022) Osformer: one-stage camouflaged instance segmentation with transformers. arXiv preprint arXiv:2207.02255
Peng H, Li B, Xiong W, Hu W, Ji R (2014) Rgbd salient object detection: a benchmark and algorithms. In: European conference on computer vision, Springer, pp 92–109
Piao Y, Ji W, Li J, Zhang M, Lu H (2019) Depth-induced multi-scale recurrent attention network for saliency detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 7254–7263
Piao Y, Rong Z, Zhang M, Ren W, Lu H (2020) A2dele: adaptive and attentive depth distiller for efficient rgb-d salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9060–9069
Prakash A, Chitta K, Geiger A (2021) Multi-modal fusion transformer for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 7077–7087
Ren L, Duan G, Huang T, Kang Z (2022) Multi-local feature relation network for few-shot learning. Neural Comput Appl 1–11
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, Springer, pp 234–241
Sa L, Yu C, Ma X, Zhao X, Xie T (2022) Attentive fine-grained recognition for cross-domain few-shot classification. Neural Comput Appl 34(6):4733–4746
Article Google Scholar
Sankaran S, Yang D, Lim S (2021) Multimodal fusion refiner networks. CoRR abs/2104.03435. arXiv:2104.03435
Shaban A, Bansal S, Liu Z, Essa I, Boots B (2017) One-shot learning for semantic segmentation. arXiv preprint arXiv:abs/1709.03410
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Song X, Jiang S, Herranz L, Chen C (2018) Learning effective rgb-d representations for scene recognition. IEEE Trans Image Process 28(2):980–993
Article MATH MathSciNet Google Scholar
Sun L, Yang K, Hu X, Hu W, Wang K (2020) Real-time fusion network for rgb-d semantic segmentation incorporating unexpected obstacle detection for road-driving images. IEEE Robot Autom Lett 5(4):5558–5565. https://doi.org/10.1109/LRA.2020.3007457
Article Google Scholar
Tao A, Sapra K, Catanzaro B (2020) Hierarchical multi-scale attention for semantic segmentation. arXiv preprint arXiv:2005.10821
Tian Z, Zhao H, Shu M, Yang Z, Li R, Jia J (2022) Prior guided feature enrichment network for few-shot segmentation. IEEE Trans Pattern Anal Mach Intell 44(2):1050–1065. https://doi.org/10.1109/TPAMI.2020.3013717
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wang H, Zhang X, Hu Y, Yang Y, Cao X, Zhen X (2020) Few-shot semantic segmentation with democratic attention networks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, Springer, pp 730–746
Wang K, Liew JH, Zou Y, Zhou D, Feng J (2019) Panet: few-shot image semantic segmentation with prototype alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9197–9206
Wang P, Cheng J, Hao F, Wang L, Feng W (2020) Embedded adaptive cross-modulation neural network for few-shot learning. Neural Comput Appl 32(10):5505–5515
Article Google Scholar
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803
Wang Y, Chen X, Cao L, Huang W, Sun F, Wang Y (2022) Multimodal token fusion for vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12186–12195
Wang Y, Chen X, Cao L, Huang W, Sun F, Wang Y (2022) Multimodal token fusion for vision transformers. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Xiao Y, Codevilla F, Gurram A, Urfalioglu O, López AM (2022) Multimodal end-to-end autonomous driving. IEEE Trans Intell Transp Syst 23(1):537–547. https://doi.org/10.1109/TITS.2020.3013234
Article Google Scholar
Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P (2021) Segformer: simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst 34:12077–12090
Google Scholar
Yang B, Liu C, Li B, Jiao J, Ye Q (2020) Prototype mixture models for few-shot semantic segmentation. In: European Conference on Computer Vision, Springer, pp 763–778
Zhang C, Lin G, Liu F, Guo J, Wu Q, Yao R (2019) Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9587–9595
Zhang C, Lin G, Liu F, Yao R, Shen C (2019) Canet: class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5217–5226
Zhang J, Yang K, Constantinescu A, Peng K, Müller K, Stiefelhagen R (2021) Trans4trans: efficient transformer for transparent object segmentation to help visually impaired people navigate in the real world. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1760–1770
Zhang X, Wei Y, Yang Y, Huang TS (2020) Sg-one: similarity guidance network for one-shot semantic segmentation. IEEE Trans Cybern 50(9):3855–3865
Article Google Scholar
Zhang Y, Sidibé D, Morel O, Meriaudeau F (2021) Incorporating depth information into few-shot semantic segmentation. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp 3582–3588. https://doi.org/10.1109/ICPR48806.2021.9412921
Zhang Y, Sidibé D, Morel O, Meriaudeau F (2021) Incorporating depth information into few-shot semantic segmentation. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp 3582–3588
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890
Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr PH, et al (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6881–6890
Zhu Z, Xu M, Bai S, Huang T, Bai X (2019) Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 593–602
Zhuang Z, Li R, Jia K, Wang Q, Li Y, Tan M (2021) Perception-aware multi-sensor fusion for 3d lidar semantic segmentation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp 16260–16270. https://doi.org/10.1109/ICCV48922.2021.01597

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (No. 61973066), Major Science and Technology Projects of Liaoning Province (No. 2021JH1/10400049), Foundation of Key Laboratory of Equipment Reliability (No. D2C20205500306), Foundation of Key Laboratory of Aerospace System Simulation (No. 6142002200301).

Author information

Authors and Affiliations

College of Information Science and Engineering, Northeastern University, Shenyang, China
Dexing Shan, Yunzhou Zhang, Xiaozheng Liu & Shitong Liu
Intelligent Systems Research Centre, University of Ulster, Derry, BT52 1SA, UK
Sonya A. Coleman & Dermot Kerr

Authors

Dexing Shan
View author publications
You can also search for this author in PubMed Google Scholar
Yunzhou Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaozheng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shitong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Sonya A. Coleman
View author publications
You can also search for this author in PubMed Google Scholar
Dermot Kerr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunzhou Zhang.

Ethics declarations

Conflict of interest

The authors declare that they do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shan, D., Zhang, Y., Liu, X. et al. MMPL-Net: multi-modal prototype learning for one-shot RGB-D segmentation. Neural Comput & Applic 35, 10297–10310 (2023). https://doi.org/10.1007/s00521-023-08235-3

Download citation

Received: 19 April 2022
Accepted: 06 January 2023
Published: 28 February 2023
Issue Date: May 2023
DOI: https://doi.org/10.1007/s00521-023-08235-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MMPL-Net: multi-modal prototype learning for one-shot RGB-D segmentation

Abstract

Access this article

Similar content being viewed by others

Few-Shot Segmentation via Complementary Prototype Learning and Cascaded Refinement

Few-Shot Segmentation via Rich Prototype Generation and Recurrent Prediction Enhancement

Few-shot semantic segmentation via multi-level feature extraction and multi-prototype localization

Data availability

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MMPL-Net: multi-modal prototype learning for one-shot RGB-D segmentation

Abstract

Access this article

Similar content being viewed by others

Few-Shot Segmentation via Complementary Prototype Learning and Cascaded Refinement

Few-Shot Segmentation via Rich Prototype Generation and Recurrent Prediction Enhancement

Few-shot semantic segmentation via multi-level feature extraction and multi-prototype localization

Data availability

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation