Skip to main content

DRENet: Giving Full Scope to Detection and Regression-Based Estimation for Video Crowd Counting

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2021 (ICANN 2021)

Abstract

Currently existing deep learning-based video crowd counting methods mainly involve leveraging the temporal correlation to improve the model. Despite their comparable results, most of these counting methods disregard the fact that crowd density varies enormously in the spatial and temporal domains of videos. This thus hinders the improvement in performance of video crowd counting. To overcome that issue, a new detection and regression estimation network, named DRENet, is proposed, which starts with estimating the crowd density by generating a video object detection-, and a mixed 3D-2D convolution-based (regression-based) density maps separately, in which the detection- and regression-based methods function well in sparse and congested scenes, respectively. Moreover, a multi-column attention-based fusion block is proposed to perceive the crowd density in a frame, and to adaptively allocate the relative weights for the video detection- and regression-based estimations. Furthermore, the optimal crowd counts are obtained with guidance from the attention block. The experimental results demonstrate that our method achieves state-of-the-art performance on three public video crowd counting datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Xu, M.L., Li, C.X., Lv, P., Lin, N., Hou, R., Zhou, B.: An efficient method of crowd aggregation computation in public areas. IEEE Trans. Circuits Syst. Video Technol. 28(10), 2814–2825 (2018)

    Article  Google Scholar 

  2. Zhang, Z., Wang, M., Geng, X.: Crowd counting in public video surveillance by label distribution learning. Neurocomputing 166, 151–163 (2015)

    Article  Google Scholar 

  3. Cong, Z., Hongsheng, L., Wang, X., Xiaokang, Y.: Cross-scene crowd counting via deep convolutional neural networks. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, pp. 833–841. IEEE (2015)

    Google Scholar 

  4. Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), San Diego, pp. 878–885. IEEE (2005)

    Google Scholar 

  5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), San Diego, vol. 1, pp. 886–893. IEEE (2005)

    Google Scholar 

  6. Viola, P., Jones, M.J., Snow, D.: Detecting pedestrians using patterns of motion and appearance. In: Proceedings Ninth IEEE International Conference on Computer Vision, Nice, pp. 734–741. IEEE (2003)

    Google Scholar 

  7. Gao, C., Li, P., Zhang, Y., Liu, J., Wang, L.: People counting based on head detection combining Adaboost and CNN in crowded surveillance environment. Neurocomputing 208, 108–116 (2016)

    Article  Google Scholar 

  8. Vora, A., Chilaka, V.: FCHD: a fast and accurate head detector. arXiv preprint arXiv:1809.08766 (2019)

  9. Xiong, F., Shi, X., Yeung, D.: Spatiotemporal modeling for crowd counting in videos. In: 2017 IEEE International Conference on Computer Vision (ICCV), Venice, pp. 5161–5169. IEEE (2017)

    Google Scholar 

  10. Fang, Y., Zhan, B., Cai, W., Gao, S., Hu, B.: Locality-constrained spatial transformer network for video crowd counting. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, pp. 814–819. IEEE (2019)

    Google Scholar 

  11. Tian, Y., Lei, Y., Zhang, J., Wang, J.Z.: PaDNet: pan-density crowd counting. IEEE Trans. Image Process. 29, 2714–2727 (2020)

    Article  Google Scholar 

  12. Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, vol. 1, pp. 589–597. IEEE Computer Society (2016)

    Google Scholar 

  13. Li, Y., Zhang, X., Chen, D.: CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, pp. 1091–1100. IEEE (2018)

    Google Scholar 

  14. Shi, M., Yang, Z., Xu, C., Chen, Q.: Revisiting perspective information for efficient crowd counting. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, pp. 7271–7280. IEEE (2019)

    Google Scholar 

  15. Zou, Z., Liu, Y., Xu, S., Wei, W., Wen, S., Zhou, P.: Crowd counting via hierarchical scale recalibration network. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, pp. 2864–2871. IEEE (2020)

    Google Scholar 

  16. Liu, J., Gao, C., Meng, D., Hauptmann, A.G.: DecideNet: counting varying density crowds through attention guided detection and density estimation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, pp. 5197–5206. IEEE (2018)

    Google Scholar 

  17. Ma, Y.J., Shuai, H.H., Cheng, W.H.: Spatiotemporal dilated convolution with uncertain matching for video-based crowd estimation. IEEE Trans. Multimedia, 1–1 (2021)

    Google Scholar 

  18. Liu, N., Long, Y., Zou, C., Niu, Q., Pan, L., Wu, H.: ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, pp. 3220–3229. IEEE (2019)

    Google Scholar 

  19. Miao, Y., Han, J., Gao, Y., Zhang, B.: ST-CNN: spatial-temporal convolutional neural network for crowd counting in videos. Pattern Recognit. Lett. 125, 113–118 (2019)

    Article  Google Scholar 

  20. Zou, Z., Shao, H., Qu, X., Wei, W., Zhou, P.: Enhanced 3D convolutional networks for crowd counting. arXiv preprint arXiv:1908.04121 (2019)

  21. Chan, A.B., Zhang-Sheng John, L., Vasconcelos, N.: Privacy preserving crowd monitoring: counting people without people models or tracking. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, pp. 1–7. IEEE (2008)

    Google Scholar 

  22. Chen, K., Chen, C.L., Gong, S., Xiang, T.: Feature mining for localised crowd counting. In: 24th British Machine Vision Conference, Bristol, pp. 1–11 (2013)

    Google Scholar 

  23. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), Venice, vol. 1, pp. 5534–5542. IEEE (2017)

    Google Scholar 

  24. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. arXiv preprint arXiv:1711.11248v3 (2018)

  25. Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: an evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 743–761 (2012)

    Article  Google Scholar 

  26. Chan, A.B., Vasconcelos, N.: Counting people with low-level features and bayesian regression. IEEE Trans. Image Process. 21(4), 2160–2177 (2012)

    Article  MathSciNet  Google Scholar 

  27. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, pp. 770–778. IEEE (2016)

    Google Scholar 

  28. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  29. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, Canada, vol. 1, pp. 91–99. MIT Press (2015)

    Google Scholar 

  30. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, pp. 6154–6162. IEEE (2018)

    Google Scholar 

  31. Wu, H., Chen, Y., Wang, N., Zhang, Z.: Sequence level semantics aggregation for video object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, pp. 9216–9224. IEEE (2019)

    Google Scholar 

  32. Wu, C., Feichtenhofer, C., Fan, H., He, K., Krähenbühl, P., Girshick, R.: Long-term feature banks for detailed video understanding. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, pp. 284–293. IEEE (2019)

    Google Scholar 

  33. Deng, J., Pan, Y., Yao, T., Zhou, W., Li, H., Mei, T.: Relation distillation networks for video object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, pp. 7022–7031. IEEE (2019)

    Google Scholar 

  34. Ciregan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classification. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, pp. 3642–3649. IEEE (2012)

    Google Scholar 

  35. Kumagai, S., Hotta, K., Kurita, T.: Mixture of counting CNNs: adaptive integration of CNNs specialized to specific appearance for crowd counting. arXiv preprint arXiv:1703.09393 (2017)

  36. Liu, L., Wang, H., Li, G., Ouyang, W., Lin, L.: Crowd counting using deep recurrent spatial-aware network. In: Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI), Palo Alto, pp. 849–855. AAAI Press/IJCAI (2018)

    Google Scholar 

  37. Fang, Y., Gao, S., Li, J., Luo, W., He, L., Hu, B.: Multi-level feature fusion based locality-constrained spatial transformer network for video crowd counting. Neurocomputing 392, 98–107 (2020)

    Article  Google Scholar 

  38. Liu, W., Salzmann, M., Fua, P.: Estimating people flows to better count them in crowded scenes. arXiv preprint arXiv:1911.10782 (2019)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yadong Mu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, C., Huang, Y., Mu, Y., Yu, X. (2021). DRENet: Giving Full Scope to Detection and Regression-Based Estimation for Video Crowd Counting. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12892. Springer, Cham. https://doi.org/10.1007/978-3-030-86340-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86340-1_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86339-5

  • Online ISBN: 978-3-030-86340-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics