MSSD: multi-scale object detector based on spatial pyramid depthwise convolution and efficient channel attention mechanism

Zhou, Yipeng; Qian, Huaming; Ding, Peng

doi:10.1007/s11554-023-01358-9

MSSD: multi-scale object detector based on spatial pyramid depthwise convolution and efficient channel attention mechanism

Research
Published: 01 September 2023

Volume 20, article number 103, (2023)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Yipeng Zhou¹,
Huaming Qian¹ &
Peng Ding¹

225 Accesses
1 Citation
Explore all metrics

Abstract

Object detection has made widespread development and remarkable progress in various fields, but, in complex application scenarios, often encounters the situation that the target features are inconspicuous and the scale range is large, making it incapable of achieving the desirable results, especially for small targets. This paper proposes a multi-scale object detector MSSD based on spatial pyramid depthwise convolution (SPDC) and efficient channel attention mechanism (ECAM) from the optimization of SSD. Firstly, use ResNet50 to replace VGG as backbone to obtain more representative features. Secondly, a plug-and-play spatial pyramid depthwise convolution module SPDC is proposed to enhance perceptual field and multi-scale feature extraction capabilities. Furthermore, we design a straightforward efficient channel attention mechanism (ECAM) to scale the weights of features on channels to derive more robust features. Finally, the feature pyramid network (FPN) with ECAM (ECAM-FPN) module is introduced in the prediction feature layer for deep feature fusion to obtain multi-scale features rich in semantic and detail information. For 300\(\times\)300 input, MSSD achieves 82.5\(\%\) mAP on PASCAL VOC07+12 dataset at 56 FPS and 48.2\(\%\) mAP on MS COCO2017 dataset, which are 8.2\(\%\) and 7.0\(\%\) higher than SSD(300), respectively. Detection of small targets is improved by 0.8\(\%\) on COCO and by 6.5\(\%\) when scaled to 512\(\times\)512. The proposed method has significant gains in cross-scale target detection while satisfying real time and is comparable with other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Lite-YOLOv3: a real-time object detector based on multi-scale slice depthwise convolution and lightweight attention mechanism

Article 07 November 2023

CR-FPN: channel relation feature pyramid network for object detection

Article 22 June 2020

L-YOLOv4: lightweight YOLOv4 based on modified RFB-s and depthwise separable convolution for multi-target detection in complex scenes

Article 12 June 2023

Data availability

The corresponding author will supply the relevant data in response to reasonable requests.

References

Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. https://doi.org/10.48550/arXiv.2004.10934 (2020). Accessed 15 June 2023
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258, https://doi.org/10.48550/arXiv.1610.02357 (2017)
Duan, K., Xie, L., Qi, H, et al.: Corner proposal network for anchor-free, two-stage object detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III, Springer, pp. 399–416, https://doi.org/10.48550/arXiv.2007.13816 (2020)
Fu C, Liu, W., Ranga, A., et al.: DSSD : deconvolutional single shot detector. CoRR abs/1701.06659. https://doi.org/10.48550/arXiv.1701.06659 (2017). Accessed 15 June 2023
Girshick, R., Donahue, J., Darrell, T., et al.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587, https://doi.org/10.1109/CVPR.2014.81 (2014)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. CoRR abs/1512.03385. https://doi.org/10.48550/arXiv.1512.03385 (2015). Accessed 15 June 2023
Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13,713–13,722, https://doi.org/10.1109/CVPR46437.2021.01350 (2021)
Howard, A.G., Zhu, M., Chen, B., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 https://doi.org/10.48550/arXiv.1704.04861 (2017). Accessed 15 June 2023
Hu, J., Shen, L., Albanie, S., et al.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 2011–2023 (2020). https://doi.org/10.1109/TPAMI.2019.2913372
Article Google Scholar
Hwang, Y.J., Lee, J.G., Moon, U.C., et al.: Ssd-tseffm: new ssd using trident feature and squeeze and extraction feature fusion. Sensors 20(13), 3630 (2020). https://doi.org/10.3390/s20133630
Article Google Scholar
Iandola, F.N., Han, S., Moskewicz, M.W., et al.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360 https://doi.org/10.48550/arXiv.1602.07360 (2016). Accessed 15 June 2023
Li, C., Li, L., Jiang, H., et al.: Yolov6: a single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976 https://doi.org/10.48550/arXiv.2209.02976 (2022). Accessed 15 June 2023
Li, X., Shi, B., Nie, T., et al.: Multi-object recognition method based on improved yolov2 model. Inf. Technol. Control 50(1), 13–27 (2021). https://doi.org/10.5755/j01.itc.50.1.25094
Article Google Scholar
Li, Y., Fan, Y., Xiang, X., et al.: Efficient and explicit modelling of image hierarchies for image restoration. arXiv preprint arXiv:2303.00748 https://doi.org/10.48550/arXiv.2303.00748 (2023). Accessed 15 June 2023
Lin, T., Dollár, P., Girshick, R.B., et al.: Feature pyramid networks for object detection. CoRR abs/1612.03144. https://doi.org/10.48550/arXiv.1612.03144 (2016)
Liu, W., Anguelov, D., Erhan, D., et al.: Ssd: Single shot multibox detector. In: European Conference on Computer Vision, Springer, pp. 21–37, https://doi.org/10.1007/978-3-319-46448-0_2 (2016)
Masood, H., Zafar, A., Ali, M.U., et al.: Tracking of a fixed-shape moving object based on the gradient descent method. Sensors 22(3), 1098 (2022). https://doi.org/10.3390/s22031098
Article Google Scholar
Qian, H., Wang, H., Feng, S., et al.: Fessd: Ssd target detection based on feature fusion and feature enhancement. J. Real-Time Image Process. 20(1), 2 (2023). https://doi.org/10.1007/s11554-023-01258-y
Article Google Scholar
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271, https://doi.org/10.1109/CVPR.2017.690 (2017)
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 https://doi.org/10.48550/arXiv.1804.02767 (2018). Accessed 15 June 2023
Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vsion and Pattern Recognition, pp. 779–788, https://doi.org/10.1109/CVPR.2016.91 (2016)
Ren, S., He, K., Girshick, R., et al.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
Schuster, R., Wasenmuller, O., Unger, C., et al.: Sdc – stacked dilated convolution: A unified descriptor network for dense matching tasks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, pp. 2551–2560. https://doi.org/10.1109/CVPR.2019.00266 (2019)
Shen, Z., Liu, Z., Li, J., et al.: Dsod: Learning deeply supervised object detectors from scratch. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1919–1927, https://doi.org/10.48550/arXiv.1708.01241 (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. https://doi.org/10.48550/arXiv.1409.1556 (2014). Accessed 15 June 2023
Wan, Q., Huang, Z., Lu, J., et al.: Seaformer: squeeze-enhanced axial transformer for mobile semantic segmentation. arXiv preprint arXiv:2301.13156 https://doi.org/10.48550/arXiv.2301.13156 (2023). Accessed 15 June 2023
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Yolov7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696 https://doi.org/10.48550/arXiv.2207.02696(2022). Accessed 15 June 2023
Wang, Q., Wu, B., Zhu, P., et al.: Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11,534–11,542. https://doi.org/10.1109/CVPR42600.2020.01155 (2020)
Wang, Z., Ji, S.: Smoothed dilated convolutions for improved dense prediction. Data Mining Knowl. Discov. 35(4), 1470–1496 (2021). https://doi.org/10.1007/s10618-021-00765-5
Article MathSciNet MATH Google Scholar
Woo, S., Park, J., Lee, J.Y., et al.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. https://doi.org/10.48550/arXiv.1807.06521 (2018)
Xu, S., Wang, X., Lv, W., et al.: Pp-yoloe: an evolved version of yolo. arXiv preprint arXiv:2203.16250 https://doi.org/10.48550/arXiv.2203.16250 (2022). Accessed 15 June 2023
Yin, Q., Yang, W., Ran, M., et al.: Fd-ssd: an improved ssd object detection algorithm based on feature fusion and dilated convolution. Signal Process.: Image Commun. 98(116), 402 (2021). https://doi.org/10.1016/j.image.2021.116402
Article Google Scholar
Zhai, S., Shang, D., Wang, S., et al.: Df-ssd: an improved ssd object detection algorithm based on densenet and feature fusion. IEEE Access 8, 24344–24357 (2020). https://doi.org/10.1109/ACCESS.2020.2971026
Article Google Scholar
Zhang, H., Zu, K., Lu, J., et al.: Epsanet: An efficient pyramid squeeze attention block on convolutional neural network. In: Proceedings of the Asian Conference on Computer Vision, pp. 1161–1177. https://doi.org/10.48550/arXiv.2105.14447 (2022)
Zhou, B., Duan, X., Ye, D., et al.: Multi-level features extraction for discontinuous target tracking in remote sensing image monitoring. Sensors 19(22), 4855 (2019). https://doi.org/10.3390/s19224855
Article Google Scholar
Zhou, X., Yi, J., Xie, G., et al.: Human detection algorithm based on improved yolo v4. Inf. Technol. Control 51(3), 485–498 (2022). https://doi.org/10.5755/j01.itc.51.3.30540
Article Google Scholar

Download references

Acknowledgements

This research is partially supported by Key-Area Research and Development Program of Guangdong Province under Grant 2020B0909020001, National Natural Science Foundation of China under Grant No.61573113.

Funding

This article is funded by Key-Area Research and Development Program of Guangdong Province and National Natural Science Foundation of China.

Author information

Authors and Affiliations

College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin, 150001, China
Yipeng Zhou, Huaming Qian & Peng Ding

Authors

Yipeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Huaming Qian
View author publications
You can also search for this author in PubMed Google Scholar
Peng Ding
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huaming Qian.

Ethics declarations

Conflict of interest

No conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhou, Y., Qian, H. & Ding, P. MSSD: multi-scale object detector based on spatial pyramid depthwise convolution and efficient channel attention mechanism. J Real-Time Image Proc 20, 103 (2023). https://doi.org/10.1007/s11554-023-01358-9

Download citation

Received: 06 May 2023
Accepted: 18 August 2023
Published: 01 September 2023
DOI: https://doi.org/10.1007/s11554-023-01358-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MSSD: multi-scale object detector based on spatial pyramid depthwise convolution and efficient channel attention mechanism

Abstract

Access this article

Similar content being viewed by others

Lite-YOLOv3: a real-time object detector based on multi-scale slice depthwise convolution and lightweight attention mechanism

CR-FPN: channel relation feature pyramid network for object detection

L-YOLOv4: lightweight YOLOv4 based on modified RFB-s and depthwise separable convolution for multi-target detection in complex scenes

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MSSD: multi-scale object detector based on spatial pyramid depthwise convolution and efficient channel attention mechanism

Abstract

Access this article

Similar content being viewed by others

Lite-YOLOv3: a real-time object detector based on multi-scale slice depthwise convolution and lightweight attention mechanism

CR-FPN: channel relation feature pyramid network for object detection

L-YOLOv4: lightweight YOLOv4 based on modified RFB-s and depthwise separable convolution for multi-target detection in complex scenes

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation