Abstract
Rational use of multilevel structures of deep networks to extract multiscale features is crucial for instance segmentation. The Feature Pyramid Network (FPN) is a classical architecture that enriches the semantic information of multiscale objects. However, inherent defects in FPN structure are bound to cause loss of information during feature extraction and feature fusion. In this paper, we propose a feature pyramid structure (called Refine-FPN) based on a non-local multi-feature aggregation operation, a module that integrates multi-scale feature to rely on attention mechanisms to improve pyramid feature representation. The algorithm enriches the feature details of feature layers by aggregating multiple features to form a contextual global feature representation. By replacing FPN with Refine-FPN in the Mask R-CNN, our model improved the performance of the mask AP by 0.6% and 0.5% on the COCO dataset, when using ResNet-50 and ResNet-101 as the backbone, respectively. Moreover, it is friendly to integrate the proposed method into other popular architectures. For example, equipping the Cascade Mask R-CNN with Refine-FPN achieves an improvement of 0.5% and 0.4% mask AP under ResNet-50 and ResNet-101, respectively.
Similar content being viewed by others
Notes
To simplify the analysis, we omit the dimension in batch direction here.
References
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Hong C, Yu J, Zhang J et al (2018) Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans Industr Inf 15(7):3952–3961
Yu J, Tan M, Zhang H et al (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell 44(2):563–578
Liu S, Qi L, Qin H et al (2018) Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 8759–8768
Chen K, Pang J, Wang J et al (2019) Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 4974–4983
Chen H, Sun K, Tian Z et al (2020) Blendmask: Top-down meets bottom-up for instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 8573–8581
Yu J, Yao J, Zhang J et al (2020) SPRNet: single-pixel reconstruction for one-stage instance segmentation. IEEE Trans Cybern 51(4):1731–1742
Zhang J, Cao Y, Wu Q (2021) Vector of locally and adaptively aggregated descriptors for image feature representation. Pattern Recogn 116:107952
Zhang J, Yang J, Yu J et al (2022) Semisupervised image classification by mutual learning of multiple self-supervised models. Int J Intell Syst 37(5):3117–3141
He K, Gkioxari G, Dollár P et al (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp 2961–2969
Lin T Y, Dollár P, Girshick R et al (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2117–2125
Lin T Y, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, Cham, pp 740–755
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767
Lin T Y, Goyal P, Girshick R et al (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp 2980–2988
Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inform Process Syst 28
Cai Z, Vasconcelos N (2018) Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6154–6162
Fang Y, Yang S, Wang X et al (2021) Instances as queries. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 6910–6919
O Pinheiro PO, Collobert R, Dollár P (2015) Learning to segment object candidates. Adv Neural Inform Process Syst 28
Pinheiro PO, Lin TY, Collobert R et al (2016) Learning to refine object segments. In: European conference on computer vision. Springer, Cham, pp 75–91
Zagoruyko S, Lerer A, Lin T Y et al (2016) A multipath network for object detection. arXiv preprint arXiv:1604.02135
Dai J, He K, Sun J (2016) Instance-aware semantic segmentation via multi-task network cascades. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 3150–3158
Li Y, Qi H, Dai J et al (2017) Fully convolutional instance-aware semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2359–2367
Dai J, He K, Li Y et al (2016) Instance-sensitive fully convolutional networks. In: European conference on computer vision. Springer, Cham, pp 534–549
Chen LC, Hermans A, Papandreou G et al (2018) Masklab: instance segmentation by refining object detection with semantic and direction features. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4013–4022
Kirillov A, Levinkov E, Andres B et al (2017) Instancecut: from edges to instances with multicut. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5008–5017
Liu S, Jia J, Fidler S et al (2017) Sgn: sequential grouping networks for instance segmentation. In: Proceedings of the IEEE international conference on computer vision. pp. 3496–3504
Uhrig J, Cordts M, Franke U et al (2016) Pixel-level encoding and depth layering for instance-level semantic labeling. In: German conference on pattern recognition. Springer, Cham, pp 14–25
De Brabandere B, Neven D, Van Gool L (2017) Semantic instance segmentation with a discriminative loss function. arXiv preprint arXiv:1708.02551
Newell A, Huang Z, Deng J (2017) Associative embedding: end-to-end learning for joint detection and grouping. Adv Neural Inform Process Syst 30
Fathi A, Wojna Z, Rathod V et al (2017) Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277
Liu W, Anguelov D, Erhan D et al (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, Cham, pp 21–37
Tan M, Pang R, Le QV (2020) Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 10781–10790
Ghiasi G, Lin T Y, Le QV (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 7036–7045
Guo C, Fan B, Zhang Q et al (2020) Augfpn: improving multi-scale feature learning for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 12595–12604
Qiao S, Chen LC, Yuille A (2021) Detectors: detecting objects with recursive feature pyramid and switchable atrous convolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 10213–10224
Hu M, Li Y, Fang L et al (2021) A2-FPN: attention aggregation based feature pyramid network for instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 15343–15352
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inform Process Systems 30
Wang X, Girshick R, Gupta A et al (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 7794–7803
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 7132–7141
Fu J, Liu J, Tian H et al (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 3146–3154
Huang Z, Wang X, Huang L et al (2019) Ccnet: criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 603–612
Cao Y, Xu J, Lin S et al (2019) Gcnet: non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE/CVF international conference on computer vision workshops
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 770–778
Gupta A, Dollar P, Girshick R (2019) Lvis: a dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 5356–5364
Chen K, Wang J, Pang J et al (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155
Xie S, Girshick R, Dollár P et al (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1492–1500
Zhao H, Shi J, Qi X et al (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2881–2890
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, X., Zhu, L., Wang, W. et al. Refine-FPN: Instance Segmentation Based on a Non-local Multi-feature Aggregation Mechanism. Neural Process Lett 55, 3411–3428 (2023). https://doi.org/10.1007/s11063-022-11016-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-022-11016-z