Skip to main content
Log in

Wide-Area Crowd Counting: Multi-view Fusion Networks for Counting in Large Scenes

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Crowd counting in single-view images has achieved outstanding performance on existing counting datasets. However, single-view counting is not applicable to large and wide scenes (e.g., public parks, long subway platforms, or event spaces) because a single camera cannot capture the whole scene in adequate detail for counting, e.g., when the scene is too large to fit into the field-of-view of the camera, too long so that the resolution is too low on faraway crowds, or when there are too many large objects that occlude large portions of the crowd. Therefore, to solve the wide-area counting task requires multiple cameras with overlapping fields-of-view. In this paper, we propose a deep neural network framework for multi-view crowd counting, which fuses information from multiple camera views to predict a scene-level density map on the ground-plane of the 3D world. We consider three versions of the fusion framework: the late fusion model fuses camera-view density map; the naïve early fusion model fuses camera-view feature maps; and the multi-view multi-scale early fusion model ensures that features aligned to the same ground-plane point have consistent scales. A rotation selection module further ensures consistent rotation alignment of the features. We test our 3 fusion models on 3 multi-view counting datasets, PETS2009, DukeMTMC, and a newly collected multi-view counting dataset containing a crowded street intersection. Our methods achieve state-of-the-art results compared to other multi-view counting baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  • Agarwal, S., Furukawa, Y., Snavely, N., Simon, I., Curless, B., Seitz, S. M., & Szeliski, R. (2011). Building rome in a day. Communications of the ACM, 54(10), 105–112.

    Article  Google Scholar 

  • Ammar, Abbas S., & Zisserman, A.,(2019) A geometric approach to obtain a bird’s eye view from an image. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.

  • Bhardwaj, R., Tummala, G. K., Ramalingam, G., Ramjee, R., & Sinha, P. (2018). Autocalib: Automatic traffic camera calibration at scale. ACM Transactions on Sensor Networks (TOSN), 14(3–4), 1–27.

    Article  Google Scholar 

  • Cao, X., Wang, Z., Zhao, Y., & Su, F., (2018) Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), pp 734–750.

  • Chan, A. B., & Vasconcelos, N. (2012). Counting people with low-level features and Bayesian regression. IEEE Transactions on Image Processing, 21(4), 2160–2177.

    Article  MathSciNet  Google Scholar 

  • Chan AB, Liang ZSJ, Vasconcelos, N. (2008) Privacy preserving crowd monitoring: Counting people without people models or tracking. In Computer Vision and Pattern Recognition, pp 1–7

  • Chen, C., Li, G., Xu, R., Chen, T., Wang, M., & Lin, L. (2019) Clusternet: Deep hierarchical cluster network with rigorously rotation-invariant representation for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4994–5002.

  • Chen, C. L., Chen, K., Gong, S., & Xiang, T. (2013). Crowd Counting and Profiling: Methodology and Evaluation. New York: Springer.

    Google Scholar 

  • Chen, K., Chen, LC., Gong, S., & Xiang, T. (2012) Feature mining for localised crowd counting. In: BMVC

  • Cheng, Zhongwei, Qin, Lei, Huang, Qingming, Yan, Shuicheng, & Tian, Qi. (2014). Recognizing human group action by layered model with multiple cues. Neurocomputing, 136, 124–135. https://doi.org/10.1016/j.neucom.2014.01.019

    Article  Google Scholar 

  • Cohen, T., & Welling, M. (2016) Group equivariant convolutional networks. In International conference on machine learning, pp 2990–2999.

  • Dieleman, S., De Fauw, J., & Kavukcuoglu, K. (2016) Exploiting cyclic symmetry in convolutional neural networks. arXiv preprint arXiv:1602.02660

  • Dittrich, F., de Oliveira, LE., Britto, Jr AS., & Koerich, AL. (2017) People counting in crowded and outdoor scenes using a hybrid multi-camera approach. arXiv preprint arXiv:1704.00326

  • Eiselein, V., Fradi, H., Keller, I., Sikora, T., & Dugelay, JL. (2013) Enhancing human detection using crowd density measures and an adaptive correction filter. In 10th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), IEEE, pp 19–24.

  • Ferryman, J., & Shahrokni, A. (2009) Pets2009: Dataset and challenge. In 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, IEEE, pp 1–6.

  • Gall, J. ., Yao, A. ., Razavi, N. ., Van Gool, L. ., & Lempitsky, V. . (2011). Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11), 2188–2202. https://doi.org/10.1109/TPAMI.2011.70

    Article  Google Scholar 

  • Gao, H., & Ji, S. (2017) Efficient and invariant convolutional neural networks for dense prediction. In 2017 IEEE International Conference on Data Mining (ICDM), IEEE, pp 871–876.

  • Ge, W., & Collins, RT. (2010) Crowd detection with a multiview sampler. In European Conference on Computer Vision, pp 324–337.

  • Guerrero-Gómez-Olmedo, R., Torre-Jiménez, B., López-Sastre, R., Maldonado-Bascón, S., & Onoro-Rubio, D. (2015) Extremely overlapping vehicle counting. In Iberian Conference on Pattern Recognition and Image Analysis, Springer, pp 423–431.

  • Idrees, H., Tayyab, M., Athrey, K., Zhang, D., Al-Maadeed, S., Rajpoot, N., & Shah, M. (2018) Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV).

  • Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015) Spatial transformer networks. In Advances in Neural Information Processing Systems (NIPS), pp 2017–2025.

  • Jiang, X., Xiao, Z., Zhang, B., Zhen, X., Cao, X., Doermann, D., & Shao, L. (2019) Crowd counting and density estimation by trellis encoder-decoder networks. In CVPR, pp 6133–6142.

  • Joachims, T. (1998) Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning, Springer, pp 137–142.

  • Junior, J. C. S. J., Musse, S. R., & Jung, C. R. (2010). Crowd analysis using computer vision techniques. IEEE Signal Processing Magazine, 27(5), 66–77.

    Google Scholar 

  • Kang, D., & Chan, A. (2018) Crowd counting by adaptively fusing predictions from an image pyramid. In BMVC.

  • Kang, D., Dhar, D., & Chan, A. (2017) Incorporating side information by adaptive convolution. In Advances in Neural Information Processing Systems, pp 3867–3877.

  • Kang, D., Ma, Z., & Chan AB (2018) Beyond counting: Comparisons of density maps for crowd analysis tasks-counting, detection, and tracking. IEEE Transactions on Circuits and Systems for Video Technology.

  • Krizhevsky, A., Sutskever, I., & Hinton, GE. (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp 1097–1105.

  • Laptev, D., Savinov, N., Buhmann, JM., Pollefeys, M. (2016) Ti-pooling: Transformation-invariant pooling for feature learning in convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 289–297.

  • Leal-Taixé, L., Milan, A., Reid, I., Roth, S., Schindler, K. (2015) Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942

  • Lempitsky, V., & Zisserman, A. (2010) Learning to count objects in images. In Advances in Neural Information Processing Systems, pp 1324–1332.

  • Li J, Huang, L., & Liu, C. (2012) People counting across multiple cameras for intelligent video surveillance. In IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance (AVSS), IEEE, pp 178–183.

  • Li, Y., Zhang, X., & Chen, D. (2018) Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1091–1100.

  • Lian, D., Li, J., Zheng, J., Luo, W., & Gao, S. (2019) Density map regression guided detection network for rgb-d crowd counting and localization. In CVPR, pp 1821–1830.

  • Liu, C., Weng, X., & Mu, Y. (2019a) Recurrent attentive zooming for joint crowd counting and precise localization. In CVPR, pp 1217–1226.

  • Liu, J., Gao, C., Meng, D., & Hauptmann, AG. (2018) Decidenet: Counting varying density crowds through attention guided detection and density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5197–5206.

  • Liu, L., Qiu, Z., Li, G., Liu, S., Ouyang, W., & Lin, L. (2019b) Crowd counting with deep structured scale integration network. In The IEEE International Conference on Computer Vision (ICCV).

  • Liu, W., Salzmann, M., & Fua, P. (2019c) Context-aware crowd counting. In CVPR, pp 5099–5108.

  • Liu, X., Yang, J., Ding, W., Wang, T., Wang, Z., & Xiong, J. (2020) Adaptive mixture regression network with local counting map for crowd counting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, Springer, pp 241–257.

  • Ma, H., Zeng, C., & Ling, C. X. (2012). A reliable people counting system via multiple cameras. ACM Transactions on Intelligent Systems and Technology (TIST), 3(2), 31.

    Google Scholar 

  • Ma, Z., Yu, L., & Chan, AB. (2015) Small instance detection by integer programming on object density maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3689–3697.

  • Maddalena, L., Petrosino, A., & Russo, F. (2014). People counting by learning their appearance in a multi-view camera environment. Pattern Recognition Letters, 36, 125–134.

    Article  Google Scholar 

  • Marana, A., Costa, LdF., Lotufo, R., & Velastin, S. (1998) On the efficacy of texture analysis for crowd monitoring. In International Symposium on Computer Graphics, Image Processing, and Vision, IEEE, pp 354–361.

  • Marcos, D., Volpi, M., Komodakis, N., & Tuia, D. (2017) Rotation equivariant vector field networks. In Proceedings of the IEEE International Conference on Computer Vision, pp 5048–5057.

  • Onoro-Rubio, D., & López-Sastre, RJ. (2016) Towards perspective-free object counting with deep learning. In European Conference on Computer Vision, Springer, pp 615–629.

  • Paragios, N., & Ramesh, V. (2001) A mrf-based approach for real-time subway monitoring. In Computer Vision and Pattern Recognition, IEEE, vol 1.

  • Pham, VQ., Kozakaya, T., Yamaguchi, O., & Okada, R. (2015) Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp 3253–3261.

  • Ranjan, V., Le, H., & Hoai, M. (2018) Iterative crowd counting. In ECCV, pp 270–285.

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp 91–99.

  • Ren, W., Kang, D., Tang, Y., & Chan, AB. (2018) Fusing crowd density maps and visual object trackers for people tracking in crowd scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5353–5362.

  • Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking.

  • Rodriguez, M., Laptev, I., Sivic, J., & Audibert, JY. (2011) Density-aware person detection and tracking in crowds. In IEEE International Conference on Computer Vision (ICCV), IEEE, pp 2423–2430.

  • Ryan, D., Denman, S., Fookes, C., & Sridharan, S. (2014). Scene invariant multi camera crowd counting. Pattern Recognition Letters, 44(8), 98–112.

    Article  Google Scholar 

  • Sabzmeydani, P., & Mori, G. (2007) Detecting pedestrians by learning shapelet features. In IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 1–8.

  • Sam, DB., Surya, S., & Babu, RV. (2017) Switching convolutional neural network for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol 1, p 6.

  • Shen, Z., Xu, Y., Ni, B., Wang, M., Hu, J., & Yang, X. (2018) Crowd counting via adversarial cross-scale consistency pursuit. In Computer Vision and Pattern Recognition, pp 5245–5254.

  • Shi, M., Yang, Z., Xu, C., & Chen, Q. (2019) Revisiting perspective information for efficient crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7279–7288.

  • Shi, Z., Zhang, L., Liu, Y., Cao, X., Ye, Y., Cheng, MM., & Zheng, G. (2018) Crowd counting with deep negative correlation learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5382–5390.

  • Sindagi, VA., & Patel, VM. (2017) Generating high-quality crowd density maps using contextual pyramid cnns. In IEEE International Conference on Computer Vision (ICCV), IEEE, pp 1879–1888.

  • Sindagi, V. A., & Patel, V. M. (2018). A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognition Letters, 107, 3–16.

    Article  Google Scholar 

  • Snavely, N., Seitz, SM., & Szeliski, R. (2006) Photo tourism: Exploring photo collections in 3d. In ACM siggraph 2006 papers, pp 835–846.

  • Sun, Y., Cheng, C., Zhang, Y., Zhang, C., Zheng, L., Wang, Z., & Wei, Y. (2020) Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6398–6407.

  • Tang, N., Lin, Y. Y., Weng, M. F., & Liao, H. Y. (2014). Cross-camera knowledge transfer for multiview people counting. IEEE Transactions on Image Processing, 24(1), 80–93.

    Article  MathSciNet  Google Scholar 

  • Viola, P., & Jones, M. J. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154.

    Article  Google Scholar 

  • Viola, P., Jones, M. J., & Snow, D. (2005). Detecting pedestrians using patterns of motion and appearance. International Journal of Computer Vision, 63(2), 153–161.

    Article  Google Scholar 

  • Wang, G., Yuan, Y., Chen, X., Li, J., & Zhou, X. (2018) Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM international conference on Multimedia, pp 274–282.

  • Wang, Q., Gao, J., et al .(2019) Learning from synthetic data for crowd counting in the wild. In CVPR, pp 8198–8207.

  • Wang, Y., & Zou, Y. (2016) Fast visual object counting via example-based density estimation. In IEEE International Conference on Image Processing (ICIP), IEEE, pp 3653–3657.

  • Weiler, M., Hamprecht, FA., & Storath, M. (2018) Learning steerable filters for rotation equivariant cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 849–858.

  • Worrall, D., & Brostow, G. (2018) Cubenet: Equivariance to 3d rotation and translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp 567–584.

  • Wu, B., & Nevatia, R. (2007). Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors. International Journal of Computer Vision, 75(2), 247–266.

    Article  Google Scholar 

  • Xu, B., & Qiu, G. (2016) Crowd density estimation based on rich features and random projection forest. In IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, pp 1–8.

  • Xu, C., Qiu, K., Fu, J., Bai, S., Xu, Y., & Bai, X. (2019) Learn to scale: Generating multipolar normalized density maps for crowd counting. In The IEEE International Conference on Computer Vision (ICCV).

  • Yan, Z., Yuan, Y., Zuo, W., Tan, X., Wang, Y., Wen, S., & Ding, E. (2019a) Perspective-guided convolution networks for crowd counting. In The IEEE International Conference on Computer Vision (ICCV).

  • Yan, Z., Yuan, Y., Zuo, W., Tan, X., Wang, Y., Wen, S., & Ding, E. (2019b) Perspective-guided convolution networks for crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 952–961.

  • Yang, Y., Li, G., Wu, Z., Su, L., Huang, Q., & Sebe, N. (2020) Reverse perspective network for perspective-aware object counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4374–4383.

  • Zhang, C., Li, H., Wang, X., & Yang, X. (2015) Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 833–841.

  • Zhang, Q., & Chan, AB. (2019) Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 8297–8306.

  • Zhang, Q., Lin, W., & Chan, AB. (2021) Cross-view cross-scene multi-view crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 557–567.

  • Zhang, Y., Zhou, D., Chen, S,, Gaom S,, & Ma, Y. (2016) Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 589–597.

  • Zhang, Z. (2000) A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence 22.

  • Zheng, L., Li, Y., Mu, Y. (2021) Learning factorized cross-view fusion for multi-view crowd counting. In 2021 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6.

Download references

Acknowledgements

This work was supported by grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. [T32-101/15-R], CityU 11212518, CityU SRG 7005665 and UGC GRF CityU 11215820), and by a Strategic Research Grant from City University of Hong Kong (Project No. 7004887). We are grateful for the support of NVIDIA Corporation with the donation of the Tesla GPU used for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qi Zhang.

Additional information

Communicated by Ming-Hsuan Yang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Q., Chan, A.B. Wide-Area Crowd Counting: Multi-view Fusion Networks for Counting in Large Scenes. Int J Comput Vis 130, 1938–1960 (2022). https://doi.org/10.1007/s11263-022-01626-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01626-4

Keywords

Navigation