Skip to main content
Log in

MGA-YOLOv4: a multi-scale pedestrian detection method based on mask-guided attention

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

To solve the problem of numerous deep convolutions in YOLOv4, which generates many redundant background features so that it cannot focus on pedestrians at a specific scale, we propose a method named MGA-YOLOv4 (Mask-Guided Attention YOLOv4) that can dynamically select the most crucial features from a cluttered background. First, we design a semantic segmentation encode-decode network to generate a fine-grained pixel-level mask that is used to serve as a weakly supervised signal in each detection branch. Second, we build a mask-guided attention module by producing attention weights of the channel dimension and space dimension and then encode them into the mask to highlight pedestrians of a specific scale and avoid background interference. To prove the effectiveness of MGA, we demonstrate the network attention map and design ablation experiments. The results show that the miss rate of the proposed method combined with the channel concatenate space decreased by 1.82% compared with the original YOLOv4. Comparison experiment results on five challenging pedestrian detection datasets show that our method achieves very competitive performance with the state-of-the-art methods and reaches a favourable trade-off between speed and accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Qichang H, Wang P, Shen C, van den Hengel A, Porikli F (2017) Pushing the limits of deep cnns for pedestrian detection. IEEE Trans Circ Syst Video Technol 28(6):1358–1368

    Google Scholar 

  2. Du X, El-Khamy M, Lee J, Davis L (2017) Fused dnn: a deep neural network fusion approach to fast and robust pedestrian detection. In: 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 953–961

    Chapter  Google Scholar 

  3. Zhou C, Yuan J (2017) Multi-label learning of part detectors for heavily occluded pedestrian detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp 3486–3495

    Google Scholar 

  4. Brazil G, Yin X, Liu X (2017) Illuminating pedestrians via simultaneous detection & segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4950–4959

  5. Ouyang W, Zhou H, Li H, Li Q, Yan J, Wang X (2017) Jointly learning deep features, deformable parts, occlusion and classification for pedestrian detection. IEEE Trans Pattern Anal Mach Intell 40(8):1874–1887

    Article  Google Scholar 

  6. Zhang S, Benenson R, Omran M, Hosang J, Schiele B (2016) How far are we from solving pedestrian detection? In: Proceedings of the iEEE conference on computer vision and pattern recognition, pp 1259–1267

    Google Scholar 

  7. Jiang X, Zhang L, Zhang T, Lv P, Zhou B, Pang Y, Mingliang X, Changsheng X (2020) Density-aware multi-task learning for crowd counting. IEEE Trans Multimedia 23:443–453

    Article  Google Scholar 

  8. Cao J, Pang Y, Han J, Gao B, Li X (2019) Taking a look at small-scale pedestrians and occluded pedestrians. IEEE Trans Image Process 29:3143–3152

    Article  MATH  Google Scholar 

  9. Papageorgiou C, Poggio T (2000) A trainable system for object detection. Int J Comput Vis 38(1):15–33

    Article  MATH  Google Scholar 

  10. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1. Ieee, pp 886–893

    Google Scholar 

  11. Yan J, Lei Z, Wen L, Li SZ (2014) The fastest deformable part model for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2497–2504

    Google Scholar 

  12. Ren S, He K, Girshick R, Sun J (2017) Faster rcnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  13. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, realtime object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

    Google Scholar 

  14. Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271

  15. Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Google Scholar 

  16. Bochkovskiy A, Wang C-Y, Liao HYM (2020) Yolov4: optimal speed and accuracy of object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  17. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 39(4):640–651

    Google Scholar 

  18. Mao J, Xiao T, Jiang Y, Cao Z (2017) What can help pedestrian detection? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3127–3136

    Google Scholar 

  19. He K, Gkioxari G, Doll’ar P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969

    Google Scholar 

  20. Cao J, Cholakkal H, Anwer RM, Khan FS, Pang Y, Shao L (2020) D2det: towards high quality object detection and instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11485–11494

    Google Scholar 

  21. Jiang X, Zhang L, Xu M, Zhang T, Lv P, Zhou B, Yang X, Pang Y (2020) Attention scaling for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4706–4715

    Google Scholar 

  22. Hu J, Shen L, Sun G (2018) Squeeze-andexcitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

    Google Scholar 

  23. Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X (2017) and Xiaoou tang. Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

    Google Scholar 

  24. Zhang S, Yang J, Schiele B (2018) Occluded pedestrian detection through guided attention in cnns. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 6995–7003

    Google Scholar 

  25. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

    Google Scholar 

  26. Liu S, Huang D et al (2018) Receptive field block net for accurate and fast object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 385–400

    Google Scholar 

  27. Komodakis N, Zagoruyko S (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In ICLR

  28. Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19

    Google Scholar 

  29. Chattopadhay A, Sarkar A, Howlader P, Balasubramanian VN (2018) Gradcam++: generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 839–847

    Chapter  Google Scholar 

  30. Zheng Z, Wang P, Liu W, Li J, Ye R, Ren D (2020) Distance-iou loss: faster and better learning for bounding box regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 12993–13000

    Google Scholar 

  31. Zhang S, Benenson R, Schiele B (2017) Citypersons: a diverse dataset for pedestrian detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3213–3221

    Google Scholar 

  32. Dollar P, Wojek C, Schiele B, Perona P (2009) Pedestrian detection: a benchmark. In: Proc.conf.On computer vision pattern recognition, pp 304–311

    Google Scholar 

  33. Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? The Kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 3354–3361

    Chapter  Google Scholar 

  34. Shao S, Zhao Z, Li B, Xiao T, Yu G, Zhang X, Sun J (2018) Crowdhuman: A benchmark for detecting human in a crowd. In: Proceedings of the iEEE conference on computer vision and pattern recognition

    Google Scholar 

  35. Milan A, Leal-Taix’e L, Reid I, Roth S, Schindler K (2016) Mot16: a benchmark for multi-object tracking. In: Proceedings of the iEEE conference on computer vision and pattern recognition

    Google Scholar 

  36. Wojek C, Dollar P, Schiele B, Perona P (2012) Pedestrian detection: an evaluation of the state of the art. IEEE Trans Pattern Anal Mach Intell 34(4):743

    Article  Google Scholar 

  37. Liu W, Liao S, Ren W, Weidong H, Yinan Y (2019) High-level semantic feature detection: a new perspective for pedestrian detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5187–5196

    Google Scholar 

  38. Liu T, Luo W, Ma L, Huang J-J, Stathaki T, Dai T (2020) Coupled network for robust pedestrian detection with gated multi-layer feature extraction and deformable occlusion handling. IEEE Trans Image Process 30:754–766

    Article  Google Scholar 

  39. Cao J, Pang Y, Zhao S, Li X (2019) High-level semantic networks for multi-scale object detection. IEEE Trans Circ Syst Video Technol 30(10):3372–3386

    Article  Google Scholar 

  40. Li J, Liang X, Shen SM, Tingfa X, Feng J, Yan S (2017) Scale-aware fast r-cnn for pedestrian detection. IEEE Trans Multimedia 20(4):985–996

    Google Scholar 

  41. Hsu WY, Lin WY (2021) Ratio-and-scale-aware yolo for pedestrian detection. IEEE Trans Image Process 30:934–947

    Article  Google Scholar 

  42. Ma J, Wan H, Wang J, Xia H, Bai C (2021) An improved one-stage pedestrian detection method based on multi-scale attention feature extraction. J Real-Time Image Proc:1–14. Springer, Berlin

  43. Lee Y, Hwang H, Shin J, Oh BT (2020) Pedestrian detection using multiscale squeeze-and-excitation module. Mach Vis Appl 31(6):1–9

    Article  Google Scholar 

  44. Zhang S, Yang X, Liu Y, Changsheng X (2020) Asymmetric multi-stage cnns for small-scale pedestrian detection. Neurocomputing 409:12–26

    Article  Google Scholar 

  45. Zhu Y, Sun W, Cao X, Wang C, Dongyang W, Yang Y, Ye N (2019) Ta-cnn: two-way attention models in deep convolutional neural network for plant recognition. Neurocomputing 365:191–200

    Article  Google Scholar 

  46. Cai Z, Saberian M, Vasconcelos N (2015) Learning complexity-aware cascades for deep pedestrian detection. In: Proceedings of the IEEE international conference on computer vision, pp 3361–3369

    Google Scholar 

  47. Zhang L, Lin L, Liang X, He K (2016) Is faster r-cnn doing well for pedestrian detection? In: European conference on computer vision. Springer, pp 443–457

    Google Scholar 

  48. Lin C, Jiwen L, Wang G, Zhou J (2018) Graininess-aware deep feature learning for pedestrian detection. In: Proceedings of the European conference on computer vision >(ECCV), pp 732–747

  49. Yu X, Choi W, Lin Y, Savarese S (2017) Subcategory-aware convolutional neural networks for object proposals and detection. In: 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 924–933

    Google Scholar 

  50. Hosang J, Benenson R, Schiele B (2017) Learning non-maximum suppression. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4507–4515

    Google Scholar 

  51. Lin T-Y , Doll’ar P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125

    Google Scholar 

  52. Cai Z, Vasconcelos N (2019) Cascade r-cnn: high quality object detection and instance segmentation. IEEE Trans Pattern Anal Mach Intell:1–12. IEEE

  53. Chu X, Zheng A, Zhang X, Sun J (2020) Detection in crowded scenes: one proposal, multiple predictions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12214–12223

    Google Scholar 

  54. Yang F, Choi W, Lin Y (2016) Exploit all the layers: fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2129–2137

    Google Scholar 

  55. Fengwei Y, Li W, Li Q, Liu Y, Shi X, Yan J (2016) Poi: multiple object tracking with high performance detection and appearance feature. In: European Conference on Computer Vision. Springer, pp 36–42

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liang Wan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, T., Wan, L., Tang, L. et al. MGA-YOLOv4: a multi-scale pedestrian detection method based on mask-guided attention. Appl Intell 52, 15308–15324 (2022). https://doi.org/10.1007/s10489-021-03061-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-03061-3

Keywords

Navigation