Skip to main content
Log in

MSSD: multi-scale object detector based on spatial pyramid depthwise convolution and efficient channel attention mechanism

  • Research
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

Object detection has made widespread development and remarkable progress in various fields, but, in complex application scenarios, often encounters the situation that the target features are inconspicuous and the scale range is large, making it incapable of achieving the desirable results, especially for small targets. This paper proposes a multi-scale object detector MSSD based on spatial pyramid depthwise convolution (SPDC) and efficient channel attention mechanism (ECAM) from the optimization of SSD. Firstly, use ResNet50 to replace VGG as backbone to obtain more representative features. Secondly, a plug-and-play spatial pyramid depthwise convolution module SPDC is proposed to enhance perceptual field and multi-scale feature extraction capabilities. Furthermore, we design a straightforward efficient channel attention mechanism (ECAM) to scale the weights of features on channels to derive more robust features. Finally, the feature pyramid network (FPN) with ECAM (ECAM-FPN) module is introduced in the prediction feature layer for deep feature fusion to obtain multi-scale features rich in semantic and detail information. For 300\(\times\)300 input, MSSD achieves 82.5\(\%\) mAP on PASCAL VOC07+12 dataset at 56 FPS and 48.2\(\%\) mAP on MS COCO2017 dataset, which are 8.2\(\%\) and 7.0\(\%\) higher than SSD(300), respectively. Detection of small targets is improved by 0.8\(\%\) on COCO and by 6.5\(\%\) when scaled to 512\(\times\)512. The proposed method has significant gains in cross-scale target detection while satisfying real time and is comparable with other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

The corresponding author will supply the relevant data in response to reasonable requests.

References

  1. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. https://doi.org/10.48550/arXiv.2004.10934 (2020). Accessed 15 June 2023

  2. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258, https://doi.org/10.48550/arXiv.1610.02357 (2017)

  3. Duan, K., Xie, L., Qi, H, et al.: Corner proposal network for anchor-free, two-stage object detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III, Springer, pp. 399–416, https://doi.org/10.48550/arXiv.2007.13816 (2020)

  4. Fu C, Liu, W., Ranga, A., et al.: DSSD : deconvolutional single shot detector. CoRR abs/1701.06659. https://doi.org/10.48550/arXiv.1701.06659 (2017). Accessed 15 June 2023

  5. Girshick, R., Donahue, J., Darrell, T., et al.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587, https://doi.org/10.1109/CVPR.2014.81 (2014)

  6. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. CoRR abs/1512.03385. https://doi.org/10.48550/arXiv.1512.03385 (2015). Accessed 15 June 2023

  7. Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13,713–13,722, https://doi.org/10.1109/CVPR46437.2021.01350 (2021)

  8. Howard, A.G., Zhu, M., Chen, B., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861https://doi.org/10.48550/arXiv.1704.04861 (2017). Accessed 15 June 2023

  9. Hu, J., Shen, L., Albanie, S., et al.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 2011–2023 (2020). https://doi.org/10.1109/TPAMI.2019.2913372

    Article  Google Scholar 

  10. Hwang, Y.J., Lee, J.G., Moon, U.C., et al.: Ssd-tseffm: new ssd using trident feature and squeeze and extraction feature fusion. Sensors 20(13), 3630 (2020). https://doi.org/10.3390/s20133630

    Article  Google Scholar 

  11. Iandola, F.N., Han, S., Moskewicz, M.W., et al.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360https://doi.org/10.48550/arXiv.1602.07360 (2016). Accessed 15 June 2023

  12. Li, C., Li, L., Jiang, H., et al.: Yolov6: a single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976https://doi.org/10.48550/arXiv.2209.02976 (2022). Accessed 15 June 2023

  13. Li, X., Shi, B., Nie, T., et al.: Multi-object recognition method based on improved yolov2 model. Inf. Technol. Control 50(1), 13–27 (2021). https://doi.org/10.5755/j01.itc.50.1.25094

    Article  Google Scholar 

  14. Li, Y., Fan, Y., Xiang, X., et al.: Efficient and explicit modelling of image hierarchies for image restoration. arXiv preprint arXiv:2303.00748https://doi.org/10.48550/arXiv.2303.00748 (2023). Accessed 15 June 2023

  15. Lin, T., Dollár, P., Girshick, R.B., et al.: Feature pyramid networks for object detection. CoRR abs/1612.03144. https://doi.org/10.48550/arXiv.1612.03144 (2016)

  16. Liu, W., Anguelov, D., Erhan, D., et al.: Ssd: Single shot multibox detector. In: European Conference on Computer Vision, Springer, pp. 21–37, https://doi.org/10.1007/978-3-319-46448-0_2 (2016)

  17. Masood, H., Zafar, A., Ali, M.U., et al.: Tracking of a fixed-shape moving object based on the gradient descent method. Sensors 22(3), 1098 (2022). https://doi.org/10.3390/s22031098

    Article  Google Scholar 

  18. Qian, H., Wang, H., Feng, S., et al.: Fessd: Ssd target detection based on feature fusion and feature enhancement. J. Real-Time Image Process. 20(1), 2 (2023). https://doi.org/10.1007/s11554-023-01258-y

    Article  Google Scholar 

  19. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271, https://doi.org/10.1109/CVPR.2017.690 (2017)

  20. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767https://doi.org/10.48550/arXiv.1804.02767 (2018). Accessed 15 June 2023

  21. Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vsion and Pattern Recognition, pp. 779–788, https://doi.org/10.1109/CVPR.2016.91 (2016)

  22. Ren, S., He, K., Girshick, R., et al.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  23. Schuster, R., Wasenmuller, O., Unger, C., et al.: Sdc – stacked dilated convolution: A unified descriptor network for dense matching tasks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, pp. 2551–2560. https://doi.org/10.1109/CVPR.2019.00266 (2019)

  24. Shen, Z., Liu, Z., Li, J., et al.: Dsod: Learning deeply supervised object detectors from scratch. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1919–1927, https://doi.org/10.48550/arXiv.1708.01241 (2017)

  25. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. https://doi.org/10.48550/arXiv.1409.1556 (2014). Accessed 15 June 2023

  26. Wan, Q., Huang, Z., Lu, J., et al.: Seaformer: squeeze-enhanced axial transformer for mobile semantic segmentation. arXiv preprint arXiv:2301.13156https://doi.org/10.48550/arXiv.2301.13156 (2023). Accessed 15 June 2023

  27. Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Yolov7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696https://doi.org/10.48550/arXiv.2207.02696(2022). Accessed 15 June 2023

  28. Wang, Q., Wu, B., Zhu, P., et al.: Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11,534–11,542. https://doi.org/10.1109/CVPR42600.2020.01155 (2020)

  29. Wang, Z., Ji, S.: Smoothed dilated convolutions for improved dense prediction. Data Mining Knowl. Discov. 35(4), 1470–1496 (2021). https://doi.org/10.1007/s10618-021-00765-5

    Article  MathSciNet  MATH  Google Scholar 

  30. Woo, S., Park, J., Lee, J.Y., et al.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. https://doi.org/10.48550/arXiv.1807.06521 (2018)

  31. Xu, S., Wang, X., Lv, W., et al.: Pp-yoloe: an evolved version of yolo. arXiv preprint arXiv:2203.16250https://doi.org/10.48550/arXiv.2203.16250 (2022). Accessed 15 June 2023

  32. Yin, Q., Yang, W., Ran, M., et al.: Fd-ssd: an improved ssd object detection algorithm based on feature fusion and dilated convolution. Signal Process.: Image Commun. 98(116), 402 (2021). https://doi.org/10.1016/j.image.2021.116402

    Article  Google Scholar 

  33. Zhai, S., Shang, D., Wang, S., et al.: Df-ssd: an improved ssd object detection algorithm based on densenet and feature fusion. IEEE Access 8, 24344–24357 (2020). https://doi.org/10.1109/ACCESS.2020.2971026

    Article  Google Scholar 

  34. Zhang, H., Zu, K., Lu, J., et al.: Epsanet: An efficient pyramid squeeze attention block on convolutional neural network. In: Proceedings of the Asian Conference on Computer Vision, pp. 1161–1177. https://doi.org/10.48550/arXiv.2105.14447 (2022)

  35. Zhou, B., Duan, X., Ye, D., et al.: Multi-level features extraction for discontinuous target tracking in remote sensing image monitoring. Sensors 19(22), 4855 (2019). https://doi.org/10.3390/s19224855

    Article  Google Scholar 

  36. Zhou, X., Yi, J., Xie, G., et al.: Human detection algorithm based on improved yolo v4. Inf. Technol. Control 51(3), 485–498 (2022). https://doi.org/10.5755/j01.itc.51.3.30540

    Article  Google Scholar 

Download references

Acknowledgements

This research is partially supported by Key-Area Research and Development Program of Guangdong Province under Grant 2020B0909020001, National Natural Science Foundation of China under Grant No.61573113.

Funding

This article is funded by Key-Area Research and Development Program of Guangdong Province and National Natural Science Foundation of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huaming Qian.

Ethics declarations

Conflict of interest

No conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, Y., Qian, H. & Ding, P. MSSD: multi-scale object detector based on spatial pyramid depthwise convolution and efficient channel attention mechanism. J Real-Time Image Proc 20, 103 (2023). https://doi.org/10.1007/s11554-023-01358-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11554-023-01358-9

Keywords

Navigation