CCST: crowd counting with swin transformer

Li, Bo; Zhang, Yong; Xu, Haihui; Yin, Baocai

doi:10.1007/s00371-022-02485-3

CCST: crowd counting with swin transformer

Original article
Published: 23 April 2022

Volume 39, pages 2671–2682, (2023)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Bo Li¹,
Yong Zhang¹,
Haihui Xu² &
…
Baocai Yin¹

1423 Accesses
16 Citations
Explore all metrics

Abstract

Accurately estimating the number of individuals contained in an image is the purpose of the crowd counting. It has always faced two major difficulties: uneven distribution of crowd density and large span of head size. Focusing on the former, most CNN-based methods divide the image into multiple patches for processing, ignoring the connection between the patches. For the latter, the multi-scale feature fusion method using feature pyramid ignores the matching relationship between the head size and the hierarchical features. In response to the above issues, we propose a crowd counting network named CCST based on swin transformer, and tailor a feature adaptive fusion regression head called FAFHead. Swin transformer can fully exchange information within and between patches, and effectively alleviate the problem of uneven distribution of crowd density. FAFHead can adaptively fuse multi-level features, improve the matching relationship between head size and feature pyramid hierarchy, and relief the problem of large span of head size available. Experimental results on common datasets show that CCST has better counting performance than all weakly supervised counting works and great majority of popular density map-based fully supervised works.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FGENet: Fine-Grained Extraction Network for Congested Crowd Counting

Self-attention Guidance Based Crowd Localization and Counting

Article 22 February 2024

CLDE-Net: crowd localization and density estimation based on CNN and transformer network

Article 08 April 2024

References

Wang, Y., Zou, Y.: Fast visual object counting via example-based density estimation. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3653–3657, IEEE, 2016
Walach, C., Wolf, L.: Learning to count with cnn boosting. In: European conference on computer vision, pp. 660–676, Springer, 2016
Lempitsky, V., Zisserman, A.: Learning to count objects in images. Adv. Neural Inf. Process. Syst. 23, 1324–1332 (2010)
Google Scholar
Onoro-Rubio, D., López-Sastre, R.J.: Towards perspective-free object counting with deep learning. In: European conference on computer vision, pp. 615–629, Springer, 2016
Zhang, H., Kyaw, Z., Chang, S.-F., Chua, T.-S.: Visual translation embedding network for visual relation detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5532–5540, 2017
Guerrero-Gómez-Olmedo, R., Torre-Jiménez, B., López-Sastre, R., Maldonado-Bascón, S., Onoro-Rubio, D.: Extremely overlapping vehicle counting. In: Iberian Conference on Pattern Recognition and Image Analysis, pp. 423–431, Springer, 2015
Li, B., Shu, X., Yan, R.: Storyboard relational model for group activity recognition. In: Proceedings of the 2nd ACM International Conference on Multimedia in Asia, pp. 1–7, 2021
Shu, X., Zhang, L., Qi, G.-J., Liu, W., Tang, J.: Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021
Shu, X., Yang, J., Yan, R., Song, Y.: Expansion-squeeze-excitation fusion network for elderly activity recognition. In: IEEE Transactions on Circuits and Systems for Video Technology, 2022
Yan, R., Tang, J., Shu, X., Li, Z., Tian, Q.: Participation-contributed temporal dynamic model for roup activity recognition. In: Proceedings of the 26th ACM international conference on Multimedia, pp. 1292–1300, 2018
Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: Higcin: hierarchical graph-based cross inference network for group activity recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020
Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: Social adaptive module for weakly-supervised group activity recognition. In: European Conference on Computer Vision, pp. 208–224, Springer, 2020
Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In: Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, pp. 90–97, IEEE, 2005
Sabzmeydani, P., Mori, G.: Detecting pedestrians by learning shapelet features. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, IEEE, 2007
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1, pp. 886–893, Ieee, 2005
Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 589–597, 2016
Lei, Y., Liu, Y., Zhang, P., Liu, L.: Towards using count-level weak supervision for crowd counting. Pattern Recognit. 109, 107616 (2021)
Article Google Scholar
Yang, Y., Li, G., Wu, Z., Su, L., Huang, Q., Sebe, N.: Weakly-supervised crowd counting learns from sorting rather than locations. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pp. 1–17, Springer, 2020
Sam, D.B., Sajjan, N.N., Maurya, H., Babu, R.V.: Almost unsupervised learning for dense crowd counting. Proc. AAAI Conf. Artif. Intell. 33, 8868–8875 (2019)
Google Scholar
von Borstel, M., Kandemir, M., Schmidt, P., Rao, K., Rajamani, K., Hamprecht, F.A.: Gaussian process density counting from weak supervision. In: European Conference on Computer Vision, pp. 365–380, Springer, 2016
Boominathan, L., Kruthiventi, S., Babu, R.V.: Crowdnet: A deep convolutional network for dense crowd counting. In: Proceedings of the 24th ACM international conference on Multimedia, pp. 640–644, 2016
Zhang, L., Shi, M., Chen, Q.: Crowd counting via scale-adaptive convolutional neural network. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1113–1121, IEEE, 2018
Cao, X., Wang, Z., Zhao, Y., Su, F.: Scale aggregation network for accurate and efficient crowd counting. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750, 2018
Jiang, X., Xiao, Z., Zhang, B., Zhen, X., Cao, X., Doermann, D., Shao, L.: Crowd counting and density estimation by trellis encoder-decoder networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6133–6142, 2019
Dong, L., Zhang, H., Ji, Y., Ding, Y.: Crowd counting by using multi-level density-based spatial information: a multi-scale cnn framework. Inf. Sci. 528, 79–91 (2020)
Article MathSciNet Google Scholar
Khan, S.D., Basalamah, S.: Scale and density invariant head detection deep model for crowd counting in pedestrian crowds. Vis. Comput. 37(8), 2127–2137 (2021)
Article Google Scholar
Li, Z., Lu, S., Dong, Y., Guo, J.: Msffa: a multi-scale feature fusion and attention mechanism network for crowd counting. Vis. Comput., pp. 1–12, (2022)
Gao, J., Han, T., Yuan, Y., Wang, Q.: Learning independent instance maps for crowd localization. arXiv preprint arXiv:2012.04164, 2020
Gao, J., Gong, M., Li, X.: Congested crowd instance localization with dilated convolutional swin transformer. arXiv preprint arXiv:2108.00584, 2021
Liang, D., Chen, X., Xu, W., Zhou, Y., Bai, X.: Transcrowd: weakly-supervised crowd counting with transformer. arXiv preprint arXiv:2104.09116, 2021
Amirgholipour Kasmani, S., He, X., Jia, W., Wang, D., Zeibots, M.: A-ccnn: adaptive ccnn for density estimation and crowd counting. arXiv e-prints, pp. arXiv–1804, 2018
Babu Sam, D., Surya, S., Venkatesh Babu, R.: Switching convolutional neural network for crowd counting. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5744–5752, 2017
Tian, Y., Lei, Y., Zhang, J., Wang, J.Z.: Padnet: pan-density crowd counting. IEEE Trans. Image Process. 29, 2714–2727 (2019)
Article MATH Google Scholar
Sam, D.B., Sajjan, N.N., Babu, R.V., Srinivasan, M.: Divide and grow: capturing huge diversity in crowd images with incrementally growing cnn. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3618–3626, 2018
Han, K., Wan, W., Yao, H., Hou, L.: Image crowd counting using convolutional neural network and markov random field. J. Adv. Comput. Intell. Intell. Inf. 21(4), 632–638 (2017)
Article Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229, Springer, 2020
Li, B., Huang, H., Zhang, A., Liu, P., Liu, C.: Approaches on crowd counting and density estimation: a review. Pattern Anal. Appl., pp. 1–22, (2021)
Kang, D., Chan, A.: Crowd counting by adaptively fusing predictions from an image pyramid. arXiv preprint arXiv:1805.06115, 2018
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125, 2017
Cenggoro, T.W., Aslamiah, A.H., Yunanto, A.: Feature pyramid networks for crowd counting. Proc. Comput. Sci. 157, 175–182 (2019)
Article Google Scholar
Kalyani, G. Janakiramaiah, B. LV, N. P. Karuna, A., et al.: Efficient crowd counting model using feature pyramid network and resnext 2021
Wang, W., Liu, Q., Wang, W.: Pyramid-dilated deep convolutional neural network for crowd counting. Appl. Intell., pp. 1–13, (2021)
Lei, T., Zhang, D., Wang, R., Li, S., Zhang, W., Nandi, A.K.: Mfp-net: multi-scale feature pyramid network for crowd counting. IET Image Process., 2021
Varior, R.R., Shuai, B., Tighe, J., Modolo, D.: Multi-scale attention network for crowd counting. arXiv preprint arXiv:1901.06026, 2019
Gao, J., Wang, Q., Yuan, Y.: Scar: spatial-/channel-wise attention regression networks for crowd counting. Neurocomputing 363, 1–8 (2019)
Article Google Scholar
Zhu, L., Zhao, Z., Lu, C., Lin, Y., Peng, Y., Yao, T.: Dual path multi-scale fusion networks with attention for crowd counting. arXiv preprint arXiv:1902.01115, 2019
Jiang, X., Zhang, L., Xu, M., Zhang, T., Lv, P., Zhou, B., Yang, X., Pang, Y.: Attention scaling for crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4706–4715, 2020
Tian, Y., Chu, X., Wang, H.: Cctrans: simplifying and improving crowd counting with transformer. In: arXiv preprint arXiv:2109.14483, 2021
Sun, G., Liu, Y., Probst, T., Paudel, D.P., Popovic, N., Van Gool, L.: Boosting crowd counting with transformers. arXiv preprint arXiv:2105.10926, 2021
Sajid, U., Chen, X., Sajid, H., Kim, T., Wang, G.: Audio-visual transformer based crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2249–2259, 2021
Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3560–3569, 2021
Zhu, C., He, Y., Savvides, M.: Feature selective anchor-free module for single-shot object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 840–849, 2019
Liu, S., Huang, D., Wang, Y.: Learning spatial fusion for single-shot object detection. arXiv preprint arXiv:1911.09516, 2019
Zhu, C., Chen, F., Shen, Z. Savvides, M.: Soft anchor-point object detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pp. 91–107, Springer, 2020
Chen, Q., Wang, Y., Yang, T., Zhang, X., Cheng, J., Sun, J.: You only look one-level feature. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13039–13048, 2021
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018
Sindagi, V., Yasarla, R., Patel, V.M.: Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020
Sindagi, V.A., Yasarla, R., Patel, V.M.: Pushing the frontiers of unconstrained crowd counting: New dataset and benchmark method. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1221–1231, 2019
Idrees, H., Tayyab, M., Athrey, K., Zhang, D., Al-Maadeed, S., Rajpoot, N., Shah, M.: Composition loss for counting, density map estimation and localization in dense crowds. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 532–546, 2018
Idrees, H., Saleemi, I., Seibert, C., Shah, M.: Multi-source multi-scale counting in extremely dense crowd images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2547–2554, 2013
Kingma D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
Sindagi V.A., Patel, V.M.: Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6, IEEE, 2017
Liu, L., Qiu, Z., Li, G., Liu, S., Ouyang, W., Lin, L.: Crowd counting with deep structured scale integration network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1774–1783, 2019
Liu, W., Salzmann, M., Fua, P.: Context-aware crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5099–5108, 2019
Li, Y., Zhang, X., Chen, D.: Csrnet: dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1091–1100, 2018
Sindagi V.A., Patel, V.M.: Multi-level bottom-top and top-bottom feature fusion for crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1002–1012, 2019
Wang, Q., Gao, J., Lin, W., Yuan, Y.: Learning from synthetic data for crowd counting in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8198–8207, 2019
Ma, Z., Wei, X., Hong, X., Gong, Y.: Bayesian loss for crowd count estimation with point supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6142–6151, 2019
Xu, C., Liang, D., Xu, Y., Bai, S., Zhan, W., Bai, X., Tomizuka, M.: Autoscale: learning to scale for crowd counting. Int. J. Comput. Vis., pp. 1–30, (2022)
Wan, J., Liu, Z., Chan, A.B.: A generalized loss function for crowd counting and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1974–1983, 2021
Abousamra, S., Hoai, M., Samaras, D., Chen, C.: Localization in the crowd with topological constraints. In: Proceedings of AAAI Conference on Artificial Intelligence, 2021
Liang, D., Xu, W., Bai, X.: An end-to-end transformer model for crowd localization. arXiv preprint arXiv:2202.13065, 2022
Shi, Z., Mettes, P., Snoek, C.G.: Counting with focus for free. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4200–4209, 2019
Zhang, A., Shen, J., Xiao, Z., Zhu, F., Zhen, X., Cao, X., Shao, L.: Relational attention network for crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6788–6797, 2019
Bai, S., He, Z., Qiao, Y., Hu, H., Wu, W., Yan, J.: Adaptive dilated network with self-correction supervision for counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4594–4603, 2020
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890, 2021
Song, Q., Wang, C., Wang, Y., Tai, Y., Wang, C., Li, J., Wu, J., Ma, J.: To choose or to fuse? scale selection for crowd counting. Proc. AAAI Conf. Artif. Intell. 35, 2576–2583 (2021)
Google Scholar

Download references

Acknowledgements

The research project is partially supported by National Key R &D Program of China (No.2021ZD0111902), National Natural Science Foundation of China (No.62072015, No.U21B2038, U1811463, U19B2039), Beijing Natural Science Foundation (4222021).

Author information

Authors and Affiliations

Beijing Key Laboratory of multimedia and intelligent software technology, Beijing Institute of artificial intelligence, Department of information science, Beijing University of Technology, Beijing, 100124, China
Bo Li, Yong Zhang & Baocai Yin
Beijing Municipal Transportation Operations Coordination Center, Beijing, 100161, China
Haihui Xu

Authors

Bo Li
View author publications
You can also search for this author in PubMed Google Scholar
Yong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haihui Xu
View author publications
You can also search for this author in PubMed Google Scholar
Baocai Yin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yong Zhang.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, B., Zhang, Y., Xu, H. et al. CCST: crowd counting with swin transformer. Vis Comput 39, 2671–2682 (2023). https://doi.org/10.1007/s00371-022-02485-3

Download citation

Accepted: 27 March 2022
Published: 23 April 2022
Issue Date: July 2023
DOI: https://doi.org/10.1007/s00371-022-02485-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CCST: crowd counting with swin transformer

Abstract

Access this article

Similar content being viewed by others

FGENet: Fine-Grained Extraction Network for Congested Crowd Counting

Self-attention Guidance Based Crowd Localization and Counting

CLDE-Net: crowd localization and density estimation based on CNN and transformer network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CCST: crowd counting with swin transformer

Abstract

Access this article

Similar content being viewed by others

FGENet: Fine-Grained Extraction Network for Congested Crowd Counting

Self-attention Guidance Based Crowd Localization and Counting

CLDE-Net: crowd localization and density estimation based on CNN and transformer network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation