Skip to main content
Log in

CCST: crowd counting with swin transformer

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Accurately estimating the number of individuals contained in an image is the purpose of the crowd counting. It has always faced two major difficulties: uneven distribution of crowd density and large span of head size. Focusing on the former, most CNN-based methods divide the image into multiple patches for processing, ignoring the connection between the patches. For the latter, the multi-scale feature fusion method using feature pyramid ignores the matching relationship between the head size and the hierarchical features. In response to the above issues, we propose a crowd counting network named CCST based on swin transformer, and tailor a feature adaptive fusion regression head called FAFHead. Swin transformer can fully exchange information within and between patches, and effectively alleviate the problem of uneven distribution of crowd density. FAFHead can adaptively fuse multi-level features, improve the matching relationship between head size and feature pyramid hierarchy, and relief the problem of large span of head size available. Experimental results on common datasets show that CCST has better counting performance than all weakly supervised counting works and great majority of popular density map-based fully supervised works.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Wang, Y., Zou, Y.: Fast visual object counting via example-based density estimation. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3653–3657, IEEE, 2016

  2. Walach, C., Wolf, L.: Learning to count with cnn boosting. In: European conference on computer vision, pp. 660–676, Springer, 2016

  3. Lempitsky, V., Zisserman, A.: Learning to count objects in images. Adv. Neural Inf. Process. Syst. 23, 1324–1332 (2010)

    Google Scholar 

  4. Onoro-Rubio, D., López-Sastre, R.J.: Towards perspective-free object counting with deep learning. In: European conference on computer vision, pp. 615–629, Springer, 2016

  5. Zhang, H., Kyaw, Z., Chang, S.-F., Chua, T.-S.: Visual translation embedding network for visual relation detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5532–5540, 2017

  6. Guerrero-Gómez-Olmedo, R., Torre-Jiménez, B., López-Sastre, R., Maldonado-Bascón, S., Onoro-Rubio, D.: Extremely overlapping vehicle counting. In: Iberian Conference on Pattern Recognition and Image Analysis, pp. 423–431, Springer, 2015

  7. Li, B., Shu, X., Yan, R.: Storyboard relational model for group activity recognition. In: Proceedings of the 2nd ACM International Conference on Multimedia in Asia, pp. 1–7, 2021

  8. Shu, X., Zhang, L., Qi, G.-J., Liu, W., Tang, J.: Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

  9. Shu, X., Yang, J., Yan, R., Song, Y.: Expansion-squeeze-excitation fusion network for elderly activity recognition. In: IEEE Transactions on Circuits and Systems for Video Technology, 2022

  10. Yan, R., Tang, J., Shu, X., Li, Z., Tian, Q.: Participation-contributed temporal dynamic model for roup activity recognition. In: Proceedings of the 26th ACM international conference on Multimedia, pp. 1292–1300, 2018

  11. Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: Higcin: hierarchical graph-based cross inference network for group activity recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

  12. Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: Social adaptive module for weakly-supervised group activity recognition. In: European Conference on Computer Vision, pp. 208–224, Springer, 2020

  13. Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In: Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, pp. 90–97, IEEE, 2005

  14. Sabzmeydani, P., Mori, G.: Detecting pedestrians by learning shapelet features. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, IEEE, 2007

  15. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1, pp. 886–893, Ieee, 2005

  16. Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 589–597, 2016

  17. Lei, Y., Liu, Y., Zhang, P., Liu, L.: Towards using count-level weak supervision for crowd counting. Pattern Recognit. 109, 107616 (2021)

    Article  Google Scholar 

  18. Yang, Y., Li, G., Wu, Z., Su, L., Huang, Q., Sebe, N.: Weakly-supervised crowd counting learns from sorting rather than locations. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pp. 1–17, Springer, 2020

  19. Sam, D.B., Sajjan, N.N., Maurya, H., Babu, R.V.: Almost unsupervised learning for dense crowd counting. Proc. AAAI Conf. Artif. Intell. 33, 8868–8875 (2019)

    Google Scholar 

  20. von Borstel, M., Kandemir, M., Schmidt, P., Rao, K., Rajamani, K., Hamprecht, F.A.: Gaussian process density counting from weak supervision. In: European Conference on Computer Vision, pp. 365–380, Springer, 2016

  21. Boominathan, L., Kruthiventi, S., Babu, R.V.: Crowdnet: A deep convolutional network for dense crowd counting. In: Proceedings of the 24th ACM international conference on Multimedia, pp. 640–644, 2016

  22. Zhang, L., Shi, M., Chen, Q.: Crowd counting via scale-adaptive convolutional neural network. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1113–1121, IEEE, 2018

  23. Cao, X., Wang, Z., Zhao, Y., Su, F.: Scale aggregation network for accurate and efficient crowd counting. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750, 2018

  24. Jiang, X., Xiao, Z., Zhang, B., Zhen, X., Cao, X., Doermann, D., Shao, L.: Crowd counting and density estimation by trellis encoder-decoder networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6133–6142, 2019

  25. Dong, L., Zhang, H., Ji, Y., Ding, Y.: Crowd counting by using multi-level density-based spatial information: a multi-scale cnn framework. Inf. Sci. 528, 79–91 (2020)

    Article  MathSciNet  Google Scholar 

  26. Khan, S.D., Basalamah, S.: Scale and density invariant head detection deep model for crowd counting in pedestrian crowds. Vis. Comput. 37(8), 2127–2137 (2021)

    Article  Google Scholar 

  27. Li, Z., Lu, S., Dong, Y., Guo, J.: Msffa: a multi-scale feature fusion and attention mechanism network for crowd counting. Vis. Comput., pp. 1–12, (2022)

  28. Gao, J., Han, T., Yuan, Y., Wang, Q.: Learning independent instance maps for crowd localization. arXiv preprint arXiv:2012.04164, 2020

  29. Gao, J., Gong, M., Li, X.: Congested crowd instance localization with dilated convolutional swin transformer. arXiv preprint arXiv:2108.00584, 2021

  30. Liang, D., Chen, X., Xu, W., Zhou, Y., Bai, X.: Transcrowd: weakly-supervised crowd counting with transformer. arXiv preprint arXiv:2104.09116, 2021

  31. Amirgholipour Kasmani, S., He, X., Jia, W., Wang, D., Zeibots, M.: A-ccnn: adaptive ccnn for density estimation and crowd counting. arXiv e-prints, pp. arXiv–1804, 2018

  32. Babu Sam, D., Surya, S., Venkatesh Babu, R.: Switching convolutional neural network for crowd counting. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5744–5752, 2017

  33. Tian, Y., Lei, Y., Zhang, J., Wang, J.Z.: Padnet: pan-density crowd counting. IEEE Trans. Image Process. 29, 2714–2727 (2019)

    Article  MATH  Google Scholar 

  34. Sam, D.B., Sajjan, N.N., Babu, R.V., Srinivasan, M.: Divide and grow: capturing huge diversity in crowd images with incrementally growing cnn. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3618–3626, 2018

  35. Han, K., Wan, W., Yao, H., Hou, L.: Image crowd counting using convolutional neural network and markov random field. J. Adv. Comput. Intell. Intell. Inf. 21(4), 632–638 (2017)

    Article  Google Scholar 

  36. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021

  37. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  38. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229, Springer, 2020

  39. Li, B., Huang, H., Zhang, A., Liu, P., Liu, C.: Approaches on crowd counting and density estimation: a review. Pattern Anal. Appl., pp. 1–22, (2021)

  40. Kang, D., Chan, A.: Crowd counting by adaptively fusing predictions from an image pyramid. arXiv preprint arXiv:1805.06115, 2018

  41. Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125, 2017

  42. Cenggoro, T.W., Aslamiah, A.H., Yunanto, A.: Feature pyramid networks for crowd counting. Proc. Comput. Sci. 157, 175–182 (2019)

    Article  Google Scholar 

  43. Kalyani, G. Janakiramaiah, B. LV, N. P. Karuna, A., et al.: Efficient crowd counting model using feature pyramid network and resnext 2021

  44. Wang, W., Liu, Q., Wang, W.: Pyramid-dilated deep convolutional neural network for crowd counting. Appl. Intell., pp. 1–13, (2021)

  45. Lei, T., Zhang, D., Wang, R., Li, S., Zhang, W., Nandi, A.K.: Mfp-net: multi-scale feature pyramid network for crowd counting. IET Image Process., 2021

  46. Varior, R.R., Shuai, B., Tighe, J., Modolo, D.: Multi-scale attention network for crowd counting. arXiv preprint arXiv:1901.06026, 2019

  47. Gao, J., Wang, Q., Yuan, Y.: Scar: spatial-/channel-wise attention regression networks for crowd counting. Neurocomputing 363, 1–8 (2019)

    Article  Google Scholar 

  48. Zhu, L., Zhao, Z., Lu, C., Lin, Y., Peng, Y., Yao, T.: Dual path multi-scale fusion networks with attention for crowd counting. arXiv preprint arXiv:1902.01115, 2019

  49. Jiang, X., Zhang, L., Xu, M., Zhang, T., Lv, P., Zhou, B., Yang, X., Pang, Y.: Attention scaling for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4706–4715, 2020

  50. Tian, Y., Chu, X., Wang, H.: Cctrans: simplifying and improving crowd counting with transformer. In: arXiv preprint arXiv:2109.14483, 2021

  51. Sun, G., Liu, Y., Probst, T., Paudel, D.P., Popovic, N., Van Gool, L.: Boosting crowd counting with transformers. arXiv preprint arXiv:2105.10926, 2021

  52. Sajid, U., Chen, X., Sajid, H., Kim, T., Wang, G.: Audio-visual transformer based crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2249–2259, 2021

  53. Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3560–3569, 2021

  54. Zhu, C., He, Y., Savvides, M.: Feature selective anchor-free module for single-shot object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 840–849, 2019

  55. Liu, S., Huang, D., Wang, Y.: Learning spatial fusion for single-shot object detection. arXiv preprint arXiv:1911.09516, 2019

  56. Zhu, C., Chen, F., Shen, Z. Savvides, M.: Soft anchor-point object detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pp. 91–107, Springer, 2020

  57. Chen, Q., Wang, Y., Yang, T., Zhang, X., Cheng, J., Sun, J.: You only look one-level feature. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13039–13048, 2021

  58. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018

  59. Sindagi, V., Yasarla, R., Patel, V.M.: Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

  60. Sindagi, V.A., Yasarla, R., Patel, V.M.: Pushing the frontiers of unconstrained crowd counting: New dataset and benchmark method. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1221–1231, 2019

  61. Idrees, H., Tayyab, M., Athrey, K., Zhang, D., Al-Maadeed, S., Rajpoot, N., Shah, M.: Composition loss for counting, density map estimation and localization in dense crowds. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 532–546, 2018

  62. Idrees, H., Saleemi, I., Seibert, C., Shah, M.: Multi-source multi-scale counting in extremely dense crowd images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2547–2554, 2013

  63. Kingma D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  64. Sindagi V.A., Patel, V.M.: Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6, IEEE, 2017

  65. Liu, L., Qiu, Z., Li, G., Liu, S., Ouyang, W., Lin, L.: Crowd counting with deep structured scale integration network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1774–1783, 2019

  66. Liu, W., Salzmann, M., Fua, P.: Context-aware crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5099–5108, 2019

  67. Li, Y., Zhang, X., Chen, D.: Csrnet: dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1091–1100, 2018

  68. Sindagi V.A., Patel, V.M.: Multi-level bottom-top and top-bottom feature fusion for crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1002–1012, 2019

  69. Wang, Q., Gao, J., Lin, W., Yuan, Y.: Learning from synthetic data for crowd counting in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8198–8207, 2019

  70. Ma, Z., Wei, X., Hong, X., Gong, Y.: Bayesian loss for crowd count estimation with point supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6142–6151, 2019

  71. Xu, C., Liang, D., Xu, Y., Bai, S., Zhan, W., Bai, X., Tomizuka, M.: Autoscale: learning to scale for crowd counting. Int. J. Comput. Vis., pp. 1–30, (2022)

  72. Wan, J., Liu, Z., Chan, A.B.: A generalized loss function for crowd counting and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1974–1983, 2021

  73. Abousamra, S., Hoai, M., Samaras, D., Chen, C.: Localization in the crowd with topological constraints. In: Proceedings of AAAI Conference on Artificial Intelligence, 2021

  74. Liang, D., Xu, W., Bai, X.: An end-to-end transformer model for crowd localization. arXiv preprint arXiv:2202.13065, 2022

  75. Shi, Z., Mettes, P., Snoek, C.G.: Counting with focus for free. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4200–4209, 2019

  76. Zhang, A., Shen, J., Xiao, Z., Zhu, F., Zhen, X., Cao, X., Shao, L.: Relational attention network for crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6788–6797, 2019

  77. Bai, S., He, Z., Qiao, Y., Hu, H., Wu, W., Yan, J.: Adaptive dilated network with self-correction supervision for counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4594–4603, 2020

  78. Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890, 2021

  79. Song, Q., Wang, C., Wang, Y., Tai, Y., Wang, C., Li, J., Wu, J., Ma, J.: To choose or to fuse? scale selection for crowd counting. Proc. AAAI Conf. Artif. Intell. 35, 2576–2583 (2021)

    Google Scholar 

Download references

Acknowledgements

The research project is partially supported by National Key R &D Program of China (No.2021ZD0111902), National Natural Science Foundation of China (No.62072015, No.U21B2038, U1811463, U19B2039), Beijing Natural Science Foundation (4222021).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yong Zhang.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, B., Zhang, Y., Xu, H. et al. CCST: crowd counting with swin transformer. Vis Comput 39, 2671–2682 (2023). https://doi.org/10.1007/s00371-022-02485-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-022-02485-3

Keywords

Navigation