Abstract
Object detection is an important field in computer vision. Detecting objects in aerial images is an extremely challenging task as the objects can be very small compared to the size of the image, the objects can have any orientation, and depending upon the altitude, the same object can appear in different sizes. YOLOv5 is a recent object detection algorithm that has a good balance of accuracy and speed. This work focuses on enhancing the YOLOv5 object detection algorithm specifically for small target detection. The accuracy on small objects has been improved by adding a new feature fusion layer in the feature pyramid part of YOLOv5 and using compound scaling to increase the input size. The modified YOLOv5 demonstrates a remarkable 11% improvement in mAP 0.5 on the small vehicle class of the DOTA dataset while being 25% smaller in terms of GFLOPS and achieving a 10.52% faster inference time, making it well-suited for real-time applications. Furthermore, the modified YOLOv5 achieves a notable 45.2% mAP 0.5 compared to 31.7% mAP 0.5 of YOLOv5 on the challenging VisDrone dataset. The modified YOLOv5 outperforms many state-of-the-art algorithms in small target detection in aerial images. In addition to performance evaluation, we also present an analysis of object sizes in pixel areas in the VisDrone and DOTA datasets. The proposed modifications demonstrate the potential for significant advancements in small target detection in aerial images and provide valuable insights for further research in this area.
Similar content being viewed by others
Data Availibility Statement
The data required for reproducing the results in this work is available at https://github.com/inderpreet1390/YOLOv5-small-target.
References
Jocher G et al ultralytics/yolov5: V5.0 - YOLOv5-P6 1280 models, AWS, Supervise.ly and YouTube integrations. https://doi.org/10.5281/zenodo.4679653
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 779–788. https://doi.org/10.1109/CVPR.2016.91
Tan M, Le QV (2020) EfficientNet: rethinking model scaling for convolutional neural networks
Xia G-S, Bai X, Ding J, Zhu Z, Belongie S, Luo J, Datcu M, Pelillo M, Zhang L (2018) Dota: a large-scale dataset for object detection in aerial images. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 3974–3983. https://doi.org/10.1109/CVPR.2018.00418
Du D, Zhu P et al (2019) Visdrone-det2019: the vision meets drone object detection in image challenge results. In: 2019 IEEE/CVF international conference on computer vision workshop (ICCVW), pp 213–226. https://doi.org/10.1109/ICCVW.2019.00030
Ding J, Xue N, Xia G-S, Bai X, Yang W, Yang MY, Belongie S, Luo J, Datcu M, Pelillo M, Zhang L (2021) Object detection in aerial images: a large-scale benchmark and challenges
Girshick R, Donahue J, Darrell T, Malik J (2016) Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans Pattern Anal Mach Intell 38(1):142–158. https://doi.org/10.1109/TPAMI.2015.2437384
Liu S, Deng W (2015) Very deep convolutional neural network based image classification using small training sample size. In: 2015 3rd IAPR Asian conference on pattern recognition (ACPR), pp 730–734. https://doi.org/10.1109/ACPR.2015.7486599
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9. https://doi.org/10.1109/CVPR.2015.7298594
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision - ECCV 2016. Springer, Cham, pp 21–37
Dai J, Li Y, He K, Sun J (2016) R–fcn: object detection via region–based fully convolutional networks. In: Lee DD, Sugiyama M, Luxburg U, Guyon I, Garnett R (eds.) NIPS, pp 379–387. http://dblp.uni-trier.de/db/conf/nips/nips2016.html#DaiLHS16
Tan M, Pang R, Le QV (2020) Efficientdet: scalable and efficient object detection. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10778–10787. https://doi.org/10.1109/CVPR42600.2020.01079
Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 9626–9635. https://doi.org/10.1109/ICCV.2019.00972
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6517–6525. https://doi.org/10.1109/CVPR.2017.690
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning. ICML’15, vol 37, pp 448–456. JMLR.org, ???
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(56):1929–1958
Redmon J, Farhadi A (2018) YOLOv3: an incremental improvement
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 936–944. https://doi.org/10.1109/CVPR.2017.106
Bochkovskiy A, Wang C-Y, Liao H-YM (2020) YOLOv4: optimal speed and accuracy of object detection
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Computer Vision - ECCV 2014. Springer, Cham, pp 740–755
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848
He K, Zhang X, Ren S, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Computer Vision - ECCV 2014. Springer, Cham, pp 346–361
Law H, Deng J (2020) Cornernet: detecting objects as paired keypoints. Int J Comput Vis 128(3):642–656. https://doi.org/10.1007/s11263-019-01204-1
Zhou X, Wang D, Krähenbühl P (2019) Objects as points. arXiv:1904.07850
Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) Centernet: keypoint triplets for object detection. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 6568–6577. https://doi.org/10.1109/ICCV.2019.00667
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. arXiv:2005.12872
Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 8759–8768. https://doi.org/10.1109/CVPR.2018.00913
Wang C-Y, Mark Liao H-Y, Wu Y-H, Chen P-Y, Hsieh J-W, Yeh I-H (2020) Cspnet: a new backbone that can enhance learning capability of cnn. In: 2020 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp 1571–1580. https://doi.org/10.1109/CVPRW50498.2020.00203
Ni J, Shen K, Chen Y, Yang SX (2023) An improved ssd-like deep network-based object detection method for indoor scenes. IEEE Trans Instrum Meas 72:1–15. https://doi.org/10.1109/TIM.2023.3244819
Ni J, Shen K, Chen Y, Cao W, Yang SX (2022) An improved deep network-based scene classification method for self-driving cars. IEEE Trans Instrum Meas 71:1–14. https://doi.org/10.1109/TIM.2022.3146923
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in neural information processing systems. Curran Associates, Inc., vol 28. ???. https://proceedings.neurips.cc/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf
Razakarivony S, Jurie F (2015) Vehicle detection in aerial imagery: a small target detection benchmark. Journal of Visual Communication and Image Representation, Elsevier
Mundhenk TN, Konjevod G, Sakla WA, Boakye K (2020) Cars overhead with context (COWC). UC San Diego Library Digital Collections, In Lawrence Livermore National Laboratory (LLNL) Open Data Initiative. https://doi.org/10.6075/J0CN72BC, http://library.ucsd.edu/dc/object/bb8332755d
Du D, Qi Y, Yu H, Yang Y, Duan K, Li G, Zhang W, Huang Q, Tian Q (2018) The unmanned aerial vehicle benchmark: object detection and tracking
Hunter JD (2007) Matplotlib: a 2d graphics environment. Comput Sci Eng 9(3):90–95. https://doi.org/10.1109/MCSE.2007.55
Clark A (2015) Pillow (PIL Fork) Documentation. readthedocs. https://buildmedia.readthedocs.org/media/pdf/pillow/latest/pillow.pdf
Umesh P (2012) Image processing in python. CSI Commun 23
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems. NIPS’17. Curran Associates Inc., Red Hook, pp 6000–6010
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest Statement
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A Nomenclature table
Appendix A Nomenclature table
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Singh, I., Munjal, G. Modified YOLOv5 for small target detection in aerial images. Multimed Tools Appl 83, 53221–53242 (2024). https://doi.org/10.1007/s11042-023-17625-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-17625-7