Abstract
To solve the problem of numerous deep convolutions in YOLOv4, which generates many redundant background features so that it cannot focus on pedestrians at a specific scale, we propose a method named MGA-YOLOv4 (Mask-Guided Attention YOLOv4) that can dynamically select the most crucial features from a cluttered background. First, we design a semantic segmentation encode-decode network to generate a fine-grained pixel-level mask that is used to serve as a weakly supervised signal in each detection branch. Second, we build a mask-guided attention module by producing attention weights of the channel dimension and space dimension and then encode them into the mask to highlight pedestrians of a specific scale and avoid background interference. To prove the effectiveness of MGA, we demonstrate the network attention map and design ablation experiments. The results show that the miss rate of the proposed method combined with the channel concatenate space decreased by 1.82% compared with the original YOLOv4. Comparison experiment results on five challenging pedestrian detection datasets show that our method achieves very competitive performance with the state-of-the-art methods and reaches a favourable trade-off between speed and accuracy.
Similar content being viewed by others
References
Qichang H, Wang P, Shen C, van den Hengel A, Porikli F (2017) Pushing the limits of deep cnns for pedestrian detection. IEEE Trans Circ Syst Video Technol 28(6):1358–1368
Du X, El-Khamy M, Lee J, Davis L (2017) Fused dnn: a deep neural network fusion approach to fast and robust pedestrian detection. In: 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 953–961
Zhou C, Yuan J (2017) Multi-label learning of part detectors for heavily occluded pedestrian detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp 3486–3495
Brazil G, Yin X, Liu X (2017) Illuminating pedestrians via simultaneous detection & segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4950–4959
Ouyang W, Zhou H, Li H, Li Q, Yan J, Wang X (2017) Jointly learning deep features, deformable parts, occlusion and classification for pedestrian detection. IEEE Trans Pattern Anal Mach Intell 40(8):1874–1887
Zhang S, Benenson R, Omran M, Hosang J, Schiele B (2016) How far are we from solving pedestrian detection? In: Proceedings of the iEEE conference on computer vision and pattern recognition, pp 1259–1267
Jiang X, Zhang L, Zhang T, Lv P, Zhou B, Pang Y, Mingliang X, Changsheng X (2020) Density-aware multi-task learning for crowd counting. IEEE Trans Multimedia 23:443–453
Cao J, Pang Y, Han J, Gao B, Li X (2019) Taking a look at small-scale pedestrians and occluded pedestrians. IEEE Trans Image Process 29:3143–3152
Papageorgiou C, Poggio T (2000) A trainable system for object detection. Int J Comput Vis 38(1):15–33
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1. Ieee, pp 886–893
Yan J, Lei Z, Wen L, Li SZ (2014) The fastest deformable part model for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2497–2504
Ren S, He K, Girshick R, Sun J (2017) Faster rcnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, realtime object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271
Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Bochkovskiy A, Wang C-Y, Liao HYM (2020) Yolov4: optimal speed and accuracy of object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 39(4):640–651
Mao J, Xiao T, Jiang Y, Cao Z (2017) What can help pedestrian detection? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3127–3136
He K, Gkioxari G, Doll’ar P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
Cao J, Cholakkal H, Anwer RM, Khan FS, Pang Y, Shao L (2020) D2det: towards high quality object detection and instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11485–11494
Jiang X, Zhang L, Xu M, Zhang T, Lv P, Zhou B, Yang X, Pang Y (2020) Attention scaling for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4706–4715
Hu J, Shen L, Sun G (2018) Squeeze-andexcitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X (2017) and Xiaoou tang. Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Zhang S, Yang J, Schiele B (2018) Occluded pedestrian detection through guided attention in cnns. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 6995–7003
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Liu S, Huang D et al (2018) Receptive field block net for accurate and fast object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 385–400
Komodakis N, Zagoruyko S (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In ICLR
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19
Chattopadhay A, Sarkar A, Howlader P, Balasubramanian VN (2018) Gradcam++: generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 839–847
Zheng Z, Wang P, Liu W, Li J, Ye R, Ren D (2020) Distance-iou loss: faster and better learning for bounding box regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 12993–13000
Zhang S, Benenson R, Schiele B (2017) Citypersons: a diverse dataset for pedestrian detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3213–3221
Dollar P, Wojek C, Schiele B, Perona P (2009) Pedestrian detection: a benchmark. In: Proc.conf.On computer vision pattern recognition, pp 304–311
Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? The Kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 3354–3361
Shao S, Zhao Z, Li B, Xiao T, Yu G, Zhang X, Sun J (2018) Crowdhuman: A benchmark for detecting human in a crowd. In: Proceedings of the iEEE conference on computer vision and pattern recognition
Milan A, Leal-Taix’e L, Reid I, Roth S, Schindler K (2016) Mot16: a benchmark for multi-object tracking. In: Proceedings of the iEEE conference on computer vision and pattern recognition
Wojek C, Dollar P, Schiele B, Perona P (2012) Pedestrian detection: an evaluation of the state of the art. IEEE Trans Pattern Anal Mach Intell 34(4):743
Liu W, Liao S, Ren W, Weidong H, Yinan Y (2019) High-level semantic feature detection: a new perspective for pedestrian detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5187–5196
Liu T, Luo W, Ma L, Huang J-J, Stathaki T, Dai T (2020) Coupled network for robust pedestrian detection with gated multi-layer feature extraction and deformable occlusion handling. IEEE Trans Image Process 30:754–766
Cao J, Pang Y, Zhao S, Li X (2019) High-level semantic networks for multi-scale object detection. IEEE Trans Circ Syst Video Technol 30(10):3372–3386
Li J, Liang X, Shen SM, Tingfa X, Feng J, Yan S (2017) Scale-aware fast r-cnn for pedestrian detection. IEEE Trans Multimedia 20(4):985–996
Hsu WY, Lin WY (2021) Ratio-and-scale-aware yolo for pedestrian detection. IEEE Trans Image Process 30:934–947
Ma J, Wan H, Wang J, Xia H, Bai C (2021) An improved one-stage pedestrian detection method based on multi-scale attention feature extraction. J Real-Time Image Proc:1–14. Springer, Berlin
Lee Y, Hwang H, Shin J, Oh BT (2020) Pedestrian detection using multiscale squeeze-and-excitation module. Mach Vis Appl 31(6):1–9
Zhang S, Yang X, Liu Y, Changsheng X (2020) Asymmetric multi-stage cnns for small-scale pedestrian detection. Neurocomputing 409:12–26
Zhu Y, Sun W, Cao X, Wang C, Dongyang W, Yang Y, Ye N (2019) Ta-cnn: two-way attention models in deep convolutional neural network for plant recognition. Neurocomputing 365:191–200
Cai Z, Saberian M, Vasconcelos N (2015) Learning complexity-aware cascades for deep pedestrian detection. In: Proceedings of the IEEE international conference on computer vision, pp 3361–3369
Zhang L, Lin L, Liang X, He K (2016) Is faster r-cnn doing well for pedestrian detection? In: European conference on computer vision. Springer, pp 443–457
Lin C, Jiwen L, Wang G, Zhou J (2018) Graininess-aware deep feature learning for pedestrian detection. In: Proceedings of the European conference on computer vision >(ECCV), pp 732–747
Yu X, Choi W, Lin Y, Savarese S (2017) Subcategory-aware convolutional neural networks for object proposals and detection. In: 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 924–933
Hosang J, Benenson R, Schiele B (2017) Learning non-maximum suppression. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4507–4515
Lin T-Y , Doll’ar P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Cai Z, Vasconcelos N (2019) Cascade r-cnn: high quality object detection and instance segmentation. IEEE Trans Pattern Anal Mach Intell:1–12. IEEE
Chu X, Zheng A, Zhang X, Sun J (2020) Detection in crowded scenes: one proposal, multiple predictions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12214–12223
Yang F, Choi W, Lin Y (2016) Exploit all the layers: fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2129–2137
Fengwei Y, Li W, Li Q, Liu Y, Shi X, Yan J (2016) Poi: multiple object tracking with high performance detection and appearance feature. In: European Conference on Computer Vision. Springer, pp 36–42
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, T., Wan, L., Tang, L. et al. MGA-YOLOv4: a multi-scale pedestrian detection method based on mask-guided attention. Appl Intell 52, 15308–15324 (2022). https://doi.org/10.1007/s10489-021-03061-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-03061-3