YOLO-SA: An Efficient Object Detection Model Based on Self-attention Mechanism

Li, Ang; song, Xiangyu; Sun, ShiJie; Zhang, Zhaoyang; Cai, Taotao; Song, Huansheng

doi:10.1007/978-981-97-2421-5_1

Ang Li¹²,
Xiangyu song¹³,
ShiJie Sun¹²,
Zhaoyang Zhang¹²,
Taotao Cai¹⁴ &
…
Huansheng Song¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14334))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

45 Accesses

Abstract

Object detector based on CNN structure has been widely used in object detection, object classification and other tasks. The traditional CNN module usually adopts complex multi-branch design, which reduces the reasoning speed and memory utilization. Moreover, in many works, attention mechanism is usually added to the object detector to extract rich features in spatial information, which are usually used as additional modules of convolution without fundamental improvement from the limitations of convolution operation. Finally, traditional object detectors often have coupled detection heads, which can compromise model performance. To solve the above problems, we propose a new object detection model, YOLO-SA, based on the current popular object detector model YOLOv5. We introduce a new reparameterized module RepVGG to replace the original DarkNet53 structure of YOLOv5 model, which greatly reduces the complexity of the model and improves the detection accuracy. We introduce a self-attention mechanism module in the feature fusion part of the model, which is independent from other convolutional layers and has higher performance than other mainstream attention mechanism modules. We replace the coupled detection head in YOLOv5 model with an anchor-based decoupled detection head, which greatly improved the convergence speed in the training process. Experiments show that the detection accuracy of the YOLO-SA model proposed by us reaches 71.2% and 75.8% on COCO2014 and VOC2012 dataset respectively, which is superior to the YOLOv5s model as the baseline and other mainstream object detection models, showing certain superiority.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Computer Vision and Pattern Recognition (2009)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Neural Information Processing Systems (2012)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar
Simonyan: Very deep convolutional networks for large-scale image recognition. (No Title) (2015)
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions (2023)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. Cornell University - arXiv (2015)
Google Scholar
Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. Computer Vision and Pattern Recognition. arxiv (2016)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation (2014)
Google Scholar
Girshick, R.: Fast R-CNN. Computer Vision and Pattern Recognition. arxiv (2015)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Cornell University - arXiv (2015)
Google Scholar
Redmon, J., Divvala, S.K., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Computer Vision and Pattern Recognition (2016)
Google Scholar
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Computer Vision and Pattern Recognition (2017)
Google Scholar
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. Computer Vision and Pattern Recognition. arxiv (2018)
Google Scholar
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection. Computer Vision and Pattern Recognition. arxiv (2020)
Google Scholar
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., Sun, J.: RepVGG: making VGG-style convnets great again. In: Computer Vision and Pattern Recognition (2021)
Google Scholar
Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. arxiv (2019)
Google Scholar
Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: Yolox: exceeding yolo series in 2021 (2021)
Google Scholar
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2018)
Google Scholar
Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. Cornell University - arXiv (2021)
Google Scholar
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: CBAM: convolutional block attention module. Computer Vision and Pattern Recognition. arxiv (2018)
Google Scholar
Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. Computer Vision and Pattern Recognition. arxiv (2018)
Google Scholar
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. Cornell University - arXiv (2019)
Google Scholar
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. Computer Vision and Pattern Recognition. arxiv (2019)
Google Scholar
Song, G., Liu, Y., Wang, X.: Revisiting the sibling head in object detector. Cornell University - arXiv (2020)
Google Scholar
Wu, Y., et al.: Rethinking classification and localization for object detection. Cornell University - arXiv (2020)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. Cornell University - arXiv (2017)
Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. Cornell University - arXiv (2016)
Google Scholar
Tan, M., Le, Q.V.: EfficientNet: rethinking model scaling for convolutional neural networks (2019)
Google Scholar
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: ECA-Net: efficient channel attention for deep convolutional neural networks. Computer Vision and Pattern Recognition. arxiv (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Chang’an University, Xi’an, China
Ang Li, ShiJie Sun, Zhaoyang Zhang & Huansheng Song
Swinburne University of Technology, Melbourne, Australia
Xiangyu song
Macquarie University, Sydney, Australia
Taotao Cai

Authors

Ang Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyu song
View author publications
You can also search for this author in PubMed Google Scholar
ShiJie Sun
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Taotao Cai
View author publications
You can also search for this author in PubMed Google Scholar
Huansheng Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhaoyang Zhang .

Editor information

Editors and Affiliations

Peng Cheng Laboratory, Shenzhen, China
Xiangyu Song
China University of Geosciences, Wuhan, China
Ruyi Feng
China University of Geosciences, Wuhan, China
Yunliang Chen
Deakin University, Burwood, VIC, Australia
Jianxin Li
University of Exeter, Exeter, UK
Geyong Min

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, A., song, X., Sun, S., Zhang, Z., Cai, T., Song, H. (2024). YOLO-SA: An Efficient Object Detection Model Based on Self-attention Mechanism. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14334. Springer, Singapore. https://doi.org/10.1007/978-981-97-2421-5_1

Download citation

DOI: https://doi.org/10.1007/978-981-97-2421-5_1
Published: 12 May 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2420-8
Online ISBN: 978-981-97-2421-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

YOLO-SA: An Efficient Object Detection Model Based on Self-attention Mechanism