Abstract
Object detection has made widespread development and remarkable progress in various fields, but, in complex application scenarios, often encounters the situation that the target features are inconspicuous and the scale range is large, making it incapable of achieving the desirable results, especially for small targets. This paper proposes a multi-scale object detector MSSD based on spatial pyramid depthwise convolution (SPDC) and efficient channel attention mechanism (ECAM) from the optimization of SSD. Firstly, use ResNet50 to replace VGG as backbone to obtain more representative features. Secondly, a plug-and-play spatial pyramid depthwise convolution module SPDC is proposed to enhance perceptual field and multi-scale feature extraction capabilities. Furthermore, we design a straightforward efficient channel attention mechanism (ECAM) to scale the weights of features on channels to derive more robust features. Finally, the feature pyramid network (FPN) with ECAM (ECAM-FPN) module is introduced in the prediction feature layer for deep feature fusion to obtain multi-scale features rich in semantic and detail information. For 300\(\times\)300 input, MSSD achieves 82.5\(\%\) mAP on PASCAL VOC07+12 dataset at 56 FPS and 48.2\(\%\) mAP on MS COCO2017 dataset, which are 8.2\(\%\) and 7.0\(\%\) higher than SSD(300), respectively. Detection of small targets is improved by 0.8\(\%\) on COCO and by 6.5\(\%\) when scaled to 512\(\times\)512. The proposed method has significant gains in cross-scale target detection while satisfying real time and is comparable with other methods.
Similar content being viewed by others
Data availability
The corresponding author will supply the relevant data in response to reasonable requests.
References
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. https://doi.org/10.48550/arXiv.2004.10934 (2020). Accessed 15 June 2023
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258, https://doi.org/10.48550/arXiv.1610.02357 (2017)
Duan, K., Xie, L., Qi, H, et al.: Corner proposal network for anchor-free, two-stage object detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III, Springer, pp. 399–416, https://doi.org/10.48550/arXiv.2007.13816 (2020)
Fu C, Liu, W., Ranga, A., et al.: DSSD : deconvolutional single shot detector. CoRR abs/1701.06659. https://doi.org/10.48550/arXiv.1701.06659 (2017). Accessed 15 June 2023
Girshick, R., Donahue, J., Darrell, T., et al.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587, https://doi.org/10.1109/CVPR.2014.81 (2014)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. CoRR abs/1512.03385. https://doi.org/10.48550/arXiv.1512.03385 (2015). Accessed 15 June 2023
Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13,713–13,722, https://doi.org/10.1109/CVPR46437.2021.01350 (2021)
Howard, A.G., Zhu, M., Chen, B., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861https://doi.org/10.48550/arXiv.1704.04861 (2017). Accessed 15 June 2023
Hu, J., Shen, L., Albanie, S., et al.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 2011–2023 (2020). https://doi.org/10.1109/TPAMI.2019.2913372
Hwang, Y.J., Lee, J.G., Moon, U.C., et al.: Ssd-tseffm: new ssd using trident feature and squeeze and extraction feature fusion. Sensors 20(13), 3630 (2020). https://doi.org/10.3390/s20133630
Iandola, F.N., Han, S., Moskewicz, M.W., et al.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360https://doi.org/10.48550/arXiv.1602.07360 (2016). Accessed 15 June 2023
Li, C., Li, L., Jiang, H., et al.: Yolov6: a single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976https://doi.org/10.48550/arXiv.2209.02976 (2022). Accessed 15 June 2023
Li, X., Shi, B., Nie, T., et al.: Multi-object recognition method based on improved yolov2 model. Inf. Technol. Control 50(1), 13–27 (2021). https://doi.org/10.5755/j01.itc.50.1.25094
Li, Y., Fan, Y., Xiang, X., et al.: Efficient and explicit modelling of image hierarchies for image restoration. arXiv preprint arXiv:2303.00748https://doi.org/10.48550/arXiv.2303.00748 (2023). Accessed 15 June 2023
Lin, T., Dollár, P., Girshick, R.B., et al.: Feature pyramid networks for object detection. CoRR abs/1612.03144. https://doi.org/10.48550/arXiv.1612.03144 (2016)
Liu, W., Anguelov, D., Erhan, D., et al.: Ssd: Single shot multibox detector. In: European Conference on Computer Vision, Springer, pp. 21–37, https://doi.org/10.1007/978-3-319-46448-0_2 (2016)
Masood, H., Zafar, A., Ali, M.U., et al.: Tracking of a fixed-shape moving object based on the gradient descent method. Sensors 22(3), 1098 (2022). https://doi.org/10.3390/s22031098
Qian, H., Wang, H., Feng, S., et al.: Fessd: Ssd target detection based on feature fusion and feature enhancement. J. Real-Time Image Process. 20(1), 2 (2023). https://doi.org/10.1007/s11554-023-01258-y
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271, https://doi.org/10.1109/CVPR.2017.690 (2017)
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767https://doi.org/10.48550/arXiv.1804.02767 (2018). Accessed 15 June 2023
Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vsion and Pattern Recognition, pp. 779–788, https://doi.org/10.1109/CVPR.2016.91 (2016)
Ren, S., He, K., Girshick, R., et al.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031
Schuster, R., Wasenmuller, O., Unger, C., et al.: Sdc – stacked dilated convolution: A unified descriptor network for dense matching tasks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, pp. 2551–2560. https://doi.org/10.1109/CVPR.2019.00266 (2019)
Shen, Z., Liu, Z., Li, J., et al.: Dsod: Learning deeply supervised object detectors from scratch. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1919–1927, https://doi.org/10.48550/arXiv.1708.01241 (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. https://doi.org/10.48550/arXiv.1409.1556 (2014). Accessed 15 June 2023
Wan, Q., Huang, Z., Lu, J., et al.: Seaformer: squeeze-enhanced axial transformer for mobile semantic segmentation. arXiv preprint arXiv:2301.13156https://doi.org/10.48550/arXiv.2301.13156 (2023). Accessed 15 June 2023
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Yolov7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696https://doi.org/10.48550/arXiv.2207.02696(2022). Accessed 15 June 2023
Wang, Q., Wu, B., Zhu, P., et al.: Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11,534–11,542. https://doi.org/10.1109/CVPR42600.2020.01155 (2020)
Wang, Z., Ji, S.: Smoothed dilated convolutions for improved dense prediction. Data Mining Knowl. Discov. 35(4), 1470–1496 (2021). https://doi.org/10.1007/s10618-021-00765-5
Woo, S., Park, J., Lee, J.Y., et al.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. https://doi.org/10.48550/arXiv.1807.06521 (2018)
Xu, S., Wang, X., Lv, W., et al.: Pp-yoloe: an evolved version of yolo. arXiv preprint arXiv:2203.16250https://doi.org/10.48550/arXiv.2203.16250 (2022). Accessed 15 June 2023
Yin, Q., Yang, W., Ran, M., et al.: Fd-ssd: an improved ssd object detection algorithm based on feature fusion and dilated convolution. Signal Process.: Image Commun. 98(116), 402 (2021). https://doi.org/10.1016/j.image.2021.116402
Zhai, S., Shang, D., Wang, S., et al.: Df-ssd: an improved ssd object detection algorithm based on densenet and feature fusion. IEEE Access 8, 24344–24357 (2020). https://doi.org/10.1109/ACCESS.2020.2971026
Zhang, H., Zu, K., Lu, J., et al.: Epsanet: An efficient pyramid squeeze attention block on convolutional neural network. In: Proceedings of the Asian Conference on Computer Vision, pp. 1161–1177. https://doi.org/10.48550/arXiv.2105.14447 (2022)
Zhou, B., Duan, X., Ye, D., et al.: Multi-level features extraction for discontinuous target tracking in remote sensing image monitoring. Sensors 19(22), 4855 (2019). https://doi.org/10.3390/s19224855
Zhou, X., Yi, J., Xie, G., et al.: Human detection algorithm based on improved yolo v4. Inf. Technol. Control 51(3), 485–498 (2022). https://doi.org/10.5755/j01.itc.51.3.30540
Acknowledgements
This research is partially supported by Key-Area Research and Development Program of Guangdong Province under Grant 2020B0909020001, National Natural Science Foundation of China under Grant No.61573113.
Funding
This article is funded by Key-Area Research and Development Program of Guangdong Province and National Natural Science Foundation of China.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
No conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, Y., Qian, H. & Ding, P. MSSD: multi-scale object detector based on spatial pyramid depthwise convolution and efficient channel attention mechanism. J Real-Time Image Proc 20, 103 (2023). https://doi.org/10.1007/s11554-023-01358-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11554-023-01358-9