Abstract
In computer vision, object detection is the classical and most challenging problem to get accurate results in detecting objects. With the significant advancement of deep learning techniques over the past decades, most researchers work on enhancing object detection, segmentation and classification. Object detection performance is measured in both detection accuracy and inference time. The detection accuracy in two stage detectors is better than single stage detectors. In 2015, the real-time object detection system YOLO was published, and it rapidly grew its iterations, with the newest release, YOLOv8 in January 2023. The YOLO achieves a high detection accuracy and inference time with single stage detector. Many applications easily adopt YOLO versions due to their high inference speed. This paper presents a complete survey of YOLO versions up to YOLOv8. This article begins with explained about the performance metrics used in object detection, post-processing methods, dataset availability and object detection techniques that are used mostly; then discusses the architectural design of each YOLO version. Finally, the diverse range of YOLO versions was discussed by highlighting their contributions to various applications.
Similar content being viewed by others
References
Matsuzaka Y, Yashiro R (2023). AI-Based Computer Vision Techniques and Expert Systems. AI, 4(1), 289-302.
Soviany P, Ionescu RT (2018). Optimizing the trade-off between single-stage and two-stage deep object detectors using image difficulty prediction. In: 2018 20th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC) (pp. 209-214). IEEE
Harzallah H, Jurie F, Schmid C (2009). Combining efficient object localization and image classification. In 2009 IEEE 12th international conference on computer vision (pp. 237-244). IEEE.
Zhao ZQ, Zheng P, Xu ST, Wu X (2019) Object detection with deep learning: A review. IEEE Trans Neural Netw Learn Syst 30(11):3212–3232
Khurana K, Awasthi R (2013) Techniques for object recognition in images and multi-object detection. Int J Adv Res Comput Eng Technol (IJARCET) 2(4):1383–1388
Yuan L, Lu F (2018). Real-time ear detection based on embedded systems. In: 2018 International Conference on Machine Learning and Cybernetics (ICMLC) (Vol. 1, pp. 115-120). IEEE
Nayagam MG, Ramar K (2015) A survey on real time object detection and tracking algorithms. Int J Appl Eng Res 10(9):8290–8297
Varma S, Sreeraj M (2013). Object detection and classification in surveillance system. In 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS) (pp. 299-303). IEEE
Verma NK, Sharma T, Rajurkar SD, Salour A (2016). Object identification for inventory management using convolutional neural network. In 2016 IEEE Applied Imagery Pattern Recognition Workshop (AIPR) (pp. 1-6). IEEE
Rana M, Bhushan M (2023) Machine learning and deep learning approach for medical image analysis: diagnosis to detection. Multimed Tools Appli 82(17):26731–26769
Raab D, Fezer E, Breitenbach J, Baumgartl H, Sauter D, Buettner R (2022). A Deep Learning-Based Model for Automated Quality Control in the Pharmaceutical Industry. In: 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC) (pp. 266-271). IEEE
Viola P, Jones M (2001) Robust real-time object detection. Int J Comput Vision 4(34–47):4
Lingani GM, Rawat DB Garuba M (2019). Smart traffic management system using deep learning for smart city applications. In:2019 IEEE 9th annual computing and communication workshop and conference (CCWC) (pp. 0101-0106). IEEE.
Durai SKS, Shamili MD (2022) Smart farming using machine learning and deep learning techniques. Decision Analy J 3:100041
Nguyen HAT, Sophea T, Gheewala SH, Rattanakom R, Areerob T, Prueksakorn K (2021) Integrating remote sensing and machine learning into environmental monitoring and assessment of land use change. Sustain Prod Consumpt 27:1239–1254
F1 score- https://encord.com/blog/f1-score-in-machine-learning/#:~:text=This%20is%20because%20the%20regular,the%20majority%20class's%20strong%20influence. Accessed 20 Jan 2024
IoU- https://towardsdatascience.com/map-mean-average-precision-might-confuse-you-5956f1bfa9e2. Accessed 12 Sept 2023
Jiang Y, Qiu H, McCartney M, Sukhatme G, Gruteser M, Bai F, ..., Govindan R (2015). Carloc: Precise positioning of automobiles. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems (pp. 253-265)
Padilla R, Netto SL, Da Silva EA (2020). A survey on performance metrics for object-detection algorithms. In 2020 international conference on systems, signals and image processing (IWSSIP) (pp. 237-242). IEEE.
Hosang J, Benenson R, Schiele B (2017) Learning non-maximum suppression. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4507-4515
Wei X, Zhang H, Liu S, Lu Y (2020) Pedestrian detection in underground mines via parallel feature transfer network. Pattern Recog 103:107195
Vennelakanti A, Shreya S, Rajendran R, Sarkar, Muddegowda D, Hanagal P (2019) Traffic sign detection and recognition using a CNN ensemble. In 2019 IEEE international conference on consumer electronics (ICCE) (pp. 1-4). IEEE
Umer S, Rout RK, Pero C, Nappi M (2022). Facial expression recognition with trade-offs between data augmentation and deep learning features. J Ambient Intel Humanized Comput. 1-15
Shao S, Li Z, Zhang T, Peng C, Yu G, Zhang X, ..., & Sun J (2019). Objects365: A large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision (pp. 8430-8439).
Fregin A, Muller J, Krebel U, Dietmayer K (2018) The driveu traffic light dataset: Introduction and comparison with existing datasets. In 2018 IEEE international conference on robotics and automation (ICRA) (pp. 3376-3383). IEEE.
Deng J, Dong W, Socher R, Li L. J., Li K, Fei-Fei L (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248-255). IEEE.
Tousch AM, Herbin S, Audibert JY (2012) Semantic hierarchies for image annotation: A survey. Patt Recog 45(1):333–345
Manikandan NS, Ganesan K (2019). Deep learning based automatic video annotation tool for self-driving car. arXiv preprint arXiv:1904.12618
Labelimg (2022), https://github.com/HumanSignal/labelImg. Accessed 28 Sept 2023
Makesense (2021), https://github.com/peng-zhihui/Make-Sense. Accessed 29 Sept 2023
Roboflow (2020), https://roboflow.com/. Accessed 29 Sept 2023
LabelBox (2018), https://labelbox.com/product/annotate/. Accessed 5 Oct 2023
Russell BC, Torralba A, Murphy KP, Freeman WT (2008) LabelMe: a database and web-based tool for image annotation. Int J Comput Vision 77:157–173
CVAT (2023) https://github.com/opencv/cvat. Accessed 5 Oct 2023
VoTT (visual object tagging tool) (2019), https://github.com/microsoft/VoTT/blob/master/README.md. Accessed 11 Oct 2023
CIFAR-10 Dataset. https://www.cs.toronto.edu/~kriz/cifar.html. Accessed 25 Oct 2023
Doon R, Rawat TK, Gautam S (2018) Cifar-10 classification using deep convolutional neural network. In 2018 IEEE Punecon (pp. 1-5). IEEE
Imagenet Dataset, https://www.image-net.org/download.php. Accessed 28 Oct 2023
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, ..., & Zitnick CL (2014). Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 (pp. 740-755). Springer International Publishing.
Veit A, Matera T, Neumann L, Matas J, Belongie S (2016) Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140
Kuznetsova A Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, ... , Ferrari V (2020). The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision 128(7), 1956-1981.
Cheng G, Han J (2016) A survey on object detection in optical remote sensing images. ISPRS J Photogram Remote Sens 117:11–28
Li K, Wan G, Cheng G, Meng L, Han J (2020) Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS Journal of Photogram Remote Sens 159:296–307
Razakarivony S, Jurie F (2016) Vehicle detection in aerial imagery: A small target detection benchmark. J Vis Commun Image Represent 34:187–203
Ch'ng CK, Chan CS (2017) Total-text: A comprehensive dataset for scene text detection and recognition. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR) (Vol. 1, pp. 935-942). IEEE.
Grosicki E, El-Abed H (2011) Icdar 2011-french handwriting recognition competition. In 2011 International Conference on Document Analysis and Recognition (pp. 1459-1463). IEEE.
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2014). Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227
Zhang S, Benenson R, Schiele B (2017) Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213-3221
Neumann L, Karg M, Zhang S, Scharfenberger C, Piegert E, Mistr S, ... ,Schiele B (2019). Nightowls: A pedestrians at night dataset. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part I 14 (pp. 691-705). Springer International Publishing.
Dollar P, Wojek C, Schiele B, Perona P (2011) Pedestrian detection: An evaluation of the state of the art. IEEE Trans Patt Analy Machine Intel 34(4):743–761
Søgaard A, Plank B, Hovy D (2014) Selection bias, label bias, and bias in ground truth. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Tutorial Abstracts. pp. 11-13
Wu X, Sahoo D, Hoi SC (2020) Recent advances in deep learning for object detection. Neurocomput 396:39–64
Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vision 57:137–154
Viola P, Jones M (2001). Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001 (Vol. 1, pp. I-I).Ieee.
Zhang H, Hong X (2019) Recent progresses on object detection: a brief review. Multimed Tools Appli 78:27809–27847
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016). Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 (pp. 21-37). Springer International Publishing.[85]
Fu J, Zhao C, Xia Y, Liu W (2020) Vehicle and wheel detection: a novel SSD-based approach and associated large-scale benchmark dataset. Multimed Tools Appli 79:12615–12634
Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. 2980-2988
Nguyen ND, Do T, Ngo TD, Le DD (2020) An evaluation of deep learning methods for small object detection. J Electric Comput Eng 2020:1–18
Zhou J, Tian Y, Li W, Wang R, Luan Z, Qian D (2019) LADet: A light-weight and adaptive network for multi-scale object detection. In Asian Conference on Machine Learning. 912-923. PMLR
Aziz L, Salam MSBH, Sheikh UU, Ayub S (2020) Exploring deep learning-based architecture, strategies, applications and current trends in generic object detection: A comprehensive review. IEEE Access 8:170461–170495
Girshick R, Donahue J, Darrell T, Malik J (2015) Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans Patt Analy Machine Intel 38(1):142–158
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Patt Analy Machine Intel 37(9):1904–1916
Girshick R (2015). Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. 1440-1448
Ren S, He K Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. 2961-2969
Kachouane M, Sahki S, Lakrouf M, Ouadah N (2012) HOG based fast human detection. In: 2012 24th International Conference on Microelectronics (ICM) (pp. 1-4). IEEE.
Cucliciu T, Lin CY, Muchtar K (2017). A DPM based object detector using HOG-LBP features. In: 2017 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW) (pp. 315-316). IEEE
Salari A, Djavadifar A, Liu X, Najjaran H (2022) Object recognition datasets and challenges: A review. Neurocomputing 495:129–152
Object detection- https://www.frontiersin.org/articles/10.3389/frobt.2015.00029/full. Accessed 11 Nov 2023
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779-788
Everingham M, Eslami SA, Van Gool L, Williams CK, Winn J, Zisserman A (2015) The pascal visual object classes challenge: A retrospective. Int J Comput Vis 111:98–136
Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7263-7271
Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767
Furusho Y, Ikeda K (2020) Theoretical analysis of skip connections and batch normalization from generalization and optimization perspectives. APSIPA Transactions on Signal and Information Processing 9
Bochkovskiy A, Wang CY, Liao HYM (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934
Zheng Z, Wang P, Liu W, Li J, Ye R, Ren D (2020) Distance-IoU loss: Faster and better learning for bounding box regression. In: Proceedings of the AAAI conference on artificial intelligence 34(07): 12993-13000
IoU loss function: https://learnopencv.com/iou-loss-functions-object-detection/#ciou-complete-iou-loss. Accessed 14 Nov 2023
Jocher G (2020) YOLOv5 by Ultralytics. https://github.com/ultralytics/yolov5. Accessed 12 Jan 2024
Ghiasi G, Cui Y, Srinivas A, Qian R, Lin TY, Cubuk ED, ..., Zoph B (2021). Simple copy-paste is a strong data augmentation method for instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2918-2928
Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2017) mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412
Li C, Li L, Jiang H, Weng K, Geng Y, Li L, Wei X (2022). YOLOv6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976
Zhang H, Wang Y, Dayoub F, Sunderhauf N (2021) Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8514-8523
Li X, Wang W, Wu L, Chen S, Hu X, Li J, ..., Yang J (2020) Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Advances in Neural Information Processing Systems, 33, 21002-21012.
Feng C, Zhong Y, Gao Y, Scott MR, Huang W (2021) Tood: Task-aligned one-stage object detection. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 3490-3499). IEEE Computer Society
Shu C, Liu Y, Gao J, Yan Z, Shen C (2021) Channel-wise knowledge distillation for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5311-5320
Ding X, Chen H, Zhang X, Huang, K, Han J, Ding G (2022) Re-parameterizing your optimizers rather than architectures. arXiv preprint arXiv:2205.15242
Wang CY, Bochkovskiy A, Liao HYM (2023) YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7464-7475
Ding X, Zhang X, Ma N, Han J, Ding G, Sun J (2021). Repvgg: Making vgg-style convnets great again. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13733-13742
Yolov8- https://sandar-ali.medium.com/ultralytics-unveiled-yolov8-on-january-10-2023-which-has-garnered-over-one-million-downloads-338d8f11ec5. Accessed 20 Jan 2024
Nanni L, Ghidoni S, Brahnam S (2017) Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recog 71:158–172
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 580-587
Jamtsho Y, Riyamongkol P, Waranusast R (2021) Real-time license plate detection for non-helmeted motorcyclist using YOLO. Ict Express 7(1):104–109
Han X, Chang J, Wang K (2021) Real-time object detection based on YOLO-v2 for tiny vehicle object. Procedia Comput Sci 183:61–72
Sahin O, Ozer S (2021) Yolodrone: Improved yolo architecture for object detection in drone images. In: 2021 44th International Conference on Telecommunications and Signal Processing (TSP) (pp. 361-365). IEEE
Ma D, Fang H, Wang N, Zhang C, Dong J, Hu H (2022) Automatic detection and counting system for pavement cracks based on PCGAN and YOLO-MF. IEEE Trans Intel Transport Syst 23(11):22166–22178
Wu D, Lv S, Jiang M, Song H (2020) Using channel pruning-based YOLO v4 deep learning algorithm for the real-time and accurate detection of apple flowers in natural environments. Comput Electron Agriculture 178:105742
Dewi C, Chen RC, Jiang X, Yu H (2022) Deep convolutional neural network for enhancing traffic sign recognition developed on Yolo V4. Multimed Tools Appli 81(26):37821–37845
Bhambani, K., Jain, T., & Sultanpure, K. A. (2020, October). Real-time face mask and social distancing violation detection system using yolo. In 2020 IEEE Bangalore Humanitarian Technology Conference (B-HTC) (pp. 1-6). IEEE.
Ficzere M, Mészáros LA, Kállai-Szabó N, Kovács A, Antal I, Nagy ZK, Galata DL (2022) Real-time coating thickness measurement and defect recognition of film coated tablets with machine vision and deep learning. Int J Pharm 623:121957
Kang L, Lu Z, Meng L, Gao Z (2024) YOLO-FA: Type-1 fuzzy attention based YOLO detector for vehicle detection. Expert Syst Appli 237:121209
Wang Y, Wang H, Xin Z (2022) Efficient detection model of steel strip surface defects based on YOLO-V7. IEEE Access 10:133936–133944
Wang CY, Liao HYM, Yeh IH (2022) Designing network design strategies through gradient path analysis. arXiv preprint arXiv:2211.04800
Jocher G, Chaurasia A, Qiu J (2023) YOLO by Ultralytics. https://github.com/ultralytics/ultralytics. Accessed 21 Jan 2024
Cui Y, Yan L, Cao Z, Liu D. (2021). Tf-blender: Temporal feature blender for video object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8138-8147)
Yan L, Ma S, Wang Q, Chen Y, Zhang X, Savakis A, Liu D (2022) Video captioning using global-local representation. IEEE Trans Circuits Syst for Video Technol 32(10):6642–6656
Yan L, Wang Q, Ma S, Wang J, Yu C (2022) Solve the puzzle of instance segmentation in videos: A weakly supervised framework with spatio-temporal collaboration. IEEE Trans Circuits Syst Video Technol 33(1):393–406
Acknowledgements
The authors are grateful to the Ministry of Heavy Industries (MHI), Government of India, for their funding support under the Scheme for Enhancement of Competitiveness in the Indian Capital Goods Sector Phase II. The authors express their gratitude to SASTRA Deemed University, Thanjavur, for providing the infrastructural facilities to carry out this research work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Vijayakumar, A., Vairavasundaram, S. YOLO-based Object Detection Models: A Review and its Applications. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-18872-y
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-024-18872-y