Skip to main content
Log in

A novel backbone architecture for pedestrian detection based on the human visual system

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Pedestrian detection using deep convolutional neural networks (DCNNs) has made a breakthrough in the last few years and researchers have proposed different DCNN architectures to detect pedestrians more accurately. Most of these architectures have a backbone based on previous state-of-the-art architectures for classification tasks and just tried to adapt them for their detection task. They are improving their performance with some heuristics, trial and error techniques, and sometimes with grid search on a space of various architectures. However, there is no research in which, firstly, the visual detection system of human has been studied, and then tried to propose a backbone architecture based on that. In this paper, we first review the state-of-the-art methods and then, having a preliminary on visual detection system in the human brain and finally, propose our architecture based on that. The intuition behind our idea can justify the evolutionary course of detection architectures from the first fully convolutional neural networks (FCNNs), like Faster R-CNN, to the modern state-of-the-art methods nowadays and give us a better understanding of why some architectures are superior to the others. The advantage of our idea is that it can be applied to most of the existing architectures with some manipulations, although it is much easier on some methods than others. We have implemented our idea based on an anchor-free method called CSP and could achieve better performance on Caltech-USA and INRIA, which are two of the most popular pedestrian detection datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Availability of data and materials

The image datasets used to support the findings of this study can be downloaded from the public websites whose cited in the article.

References

  1. Paisitkriangkrai, S., Shen, C., van den Hengel, A.: Pedestrian detection with spatially pooled features and structured ensemble learning. IEEE Trans. Pattern Anal. Mach. Intell. 38(6), 1243–1257 (2015)

    Article  Google Scholar 

  2. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

  3. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Arxiv preprint arXiv:1409.1556 (2014)

  4. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

  5. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  6. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013)

  7. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)

  8. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015)

    Article  Google Scholar 

  9. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)

  10. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Springer (2015)

  11. Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)

  12. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

  13. Liu, Z., Chen, Z., Li, Z., Hu, W.: An efficient pedestrian detection method based on YOLOv2. Math. Probl. Eng. 2018 , 1–10 (2018)

  14. Du, X., El-Khamy, M., Lee, J., Davis, L.: Fused DNN: A deep neural network fusion approach to fast and robust pedestrian detection. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 953–961. IEEE (2017)

  15. Perreault, H., Bilodeau, G.-A., Saunier, N., Héritier, M.: Spotnet: self-attention multi-task network for object detection. In: 2020 17th Conference on Computer and Robot Vision (CRV), pp. 230–237. IEEE (2020)

  16. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. Arxiv preprint arXiv:1511.07122 (2015)

  17. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773 (2017)

  18. Jaderberg, M., Simonyan, K., Zisserman, A.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)

  19. Wang, X., Xiao, T., Jiang, Y., Shao, S., Sun, J., Shen, C.: Repulsion loss: detecting pedestrians in a crowd. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7774–7783 (2018)

  20. Pang, Y., Xie, J., Khan, M.H., Anwer, R.M., Khan, F.S., Shao, L.: Mask-guided attention network for occluded pedestrian detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4967–4975 (2019)

  21. Li, J., Liang, X., Shen, S., Xu, T., Feng, J., Yan, S.: Scale-aware fast R-CNN for pedestrian detection. IEEE Trans. Multimedia 20(4), 985–996 (2017)

    Google Scholar 

  22. Singh, B., Davis, L.S.: An analysis of scale invariance in object detection snip. In: Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp. 3578–3587 (2018)

  23. Singh, B., Najibi, M., Davis, L.S.: Sniper: Efficient multi-scale training. In: Advances in Neural Information Processing Systems, pp. 9310–9320 (2018)

  24. Liu, Y., Wang, Y., Wang, S., Liang, T., Zhao, Q., Tang, Z., Ling, H.: CBNet: a novel composite backbone network architecture for object detection. In: Association for the Advancement of Artificial Intelligence (AAAI), pp. 11653–11660 (2020)

  25. Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)

  26. He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4918–4927 (2019)

  27. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)

  28. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017)

  29. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. Arxiv preprint arXiv:1804.02767 (2018)

  30. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)

  31. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37. Springer (2016)

  32. Law, H., Deng, J.: Cornernet: Detecting objects as paired keypoints. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750 (2018)

  33. Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: Centernet: keypoint triplets for object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6569–6578 (2019)

  34. Liu, W., Liao, S., Ren, W., Hu, W., Yu, Y.: High-level semantic feature detection: a new perspective for pedestrian detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5187–5196 (2019)

  35. Song, T., Sun, L., Xie, D., Sun, H., Pu, S.: Small-scale pedestrian detection based on topological line localization and temporal feature aggregation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 536–551 (2018)

  36. Kong, T., Sun, F., Liu, H., Jiang, Y., Li, L., Shi, J.: FoveaBox: beyound anchor-based object detection. IEEE Trans. Image Process. 29, 7389–7398 (2020)

    Article  Google Scholar 

  37. Zhang, L., Lin, L., Liang, X., He, K.: Is faster R-CNN doing well for pedestrian detection? In: European Conference on Computer Vision, pp. 443–457. Springer (2016)

  38. Wang, S., Cheng, J., Liu, H., Tang, M.: Pcn: Part and context information for pedestrian detection with CNNs. Arxiv preprint arXiv:1804.04483 (2018)

  39. Lin, C., Lu, J., Wang, G., Zhou, J.: Graininess-aware deep feature learning for pedestrian detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 732–747 (2018)

  40. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)

  41. Girshick, R.: Fast r-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)

  42. Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: an evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 743–761 (2011)

    Article  Google Scholar 

  43. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), pp. 886–893. IEEE (2005)

  44. Szeliski, R.: Computer Vision: Algorithms and Applications. Springer (2010)

  45. Kanade, T.: Three-Dimensional Machine Vision, vol. 21. Springer (2012)

  46. Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 586–591. IEEE Computer Society (1991)

  47. Sebe, N., Cohen, I., Garg, A., Huang, T.S.: Machine Learning in Computer Vision, vol. 29. Springer (2005)

  48. Yang, J., Liu, L., Jiang, T., Fan, Y.: A modified Gabor filter design method for fingerprint image enhancement. Pattern Recogn. Lett. 24(12), 1805–1817 (2003)

    Article  Google Scholar 

  49. Viola, P., Jones, M.: Robust real-time face detection. In: Null, p. 747. IEEE (2001)

  50. Wojek, C., Schiele, B.: A performance evaluation of single and multi-feature people detection. In: Joint Pattern Recognition Symposium, pp. 82–91. Springer (2008)

  51. Marin, J., Vázquez, D., López, A.M., Amores, J., Leibe, B.: Random forests of local experts for pedestrian detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2592–2599 (2013)

  52. Dollár, P., Appel, R., Belongie, S., Perona, P.: Fast feature pyramids for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 36(8), 1532–1545 (2014)

    Article  Google Scholar 

  53. Zhang, S., Bauckhage, C., Cremers, A.B.: Informed haar-like features improve pedestrian detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 947–954 (2014)

  54. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)

    Article  Google Scholar 

  55. Fei-Fei Li, R.K., Danfei, X.: CNN architectures. http://cs231n.stanford.edu/slides/2020/lecture_9.pdf. Accessed 1 October 2020

  56. Bui, H.M., Lech, M., Cheng, E., Neville, K., Burnett, I.S.: Object recognition using deep convolutional features transformed by a recursive network structure. IEEE Access 4, 10059–10066 (2016)

    Article  Google Scholar 

  57. Vaillant, R., Monrocq, C., Le Cun, Y.: Original approach for the localisation of objects in images. IEE Proc.-Vis., Image Signal Process. 141(4), 245–250 (1994)

    Article  Google Scholar 

  58. Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)

    Article  Google Scholar 

  59. Baars, B., Gage, N.M.: Fundamentals of Cognitive Neuroscience: A Beginner's Guide. Academic Press (2013)

  60. Gage, N.M., Baars, B.: Fundamentals of Cognitive Neuroscience: A Beginner's Guide. Academic Press (2018)

  61. Schieber, M., Squire, L., Baker, J.: Descending control of movement. In: Fundamental Neuroscience, 3rd edn. Academic Press (2008)

  62. Neuroscience, F.: In: Squire, L.R., Bloom, F.E., McConnell, S.K., Roberts, J.L., Spitzer, N.C., Zigmond, M.J. (eds.) Fundamental Neuroscience, 2nd edn.. Elsevier Science, San Diego (2003)

  63. Zhang, S., Benenson, R., Schiele, B.: Citypersons: a diverse dataset for pedestrian detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3221 (2017)

  64. Kingma, D.P., Ba, J.A.: A method for stochastic optimization. Arxiv 434, 2014 (2019). arXiv:1412.6980

  65. Zhang, S., Benenson, R., Omran, M., Hosang, J., Schiele, B.: Towards reaching human performance in pedestrian detection. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 973–986 (2017)

    Article  Google Scholar 

  66. Zhang, S., Benenson, R., Omran, M., Hosang, J., Schiele, B.: How far are we from solving pedestrian detection? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1259–1267 (2016)

Download references

Acknowledgements

Authors would like to acknowledge Iran Telecommunication Research Center, for supports throughout this research.

Author information

Authors and Affiliations

Authors

Contributions

All authors took part in the discussion of the work described in this paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Mahmoud Saeidi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saeidi, M., Arabsorkhi, A. A novel backbone architecture for pedestrian detection based on the human visual system. Vis Comput 38, 2223–2237 (2022). https://doi.org/10.1007/s00371-021-02280-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-021-02280-6

Keywords

Navigation