Skip to main content
Log in

TransCrowd: weakly-supervised crowd counting with transformers

  • Research Paper
  • Special Focus on Deep Learning for Computer Vision
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

The mainstream crowd counting methods usually utilize the convolution neural network (CNN) to regress a density map, requiring point-level annotations. However, annotating each person with a point is an expensive and laborious process. During the testing phase, the point-level annotations are not considered to evaluate the counting accuracy, which means the point-level annotations are redundant. Hence, it is desirable to develop weakly-supervised counting methods that just rely on count-level annotations, a more economical way of labeling. Current weakly-supervised counting methods adopt the CNN to regress a total count of the crowd by an image-to-count paradigm. However, having limited receptive fields for context modeling is an intrinsic limitation of these weakly-supervised CNN-based methods. These methods thus cannot achieve satisfactory performance, with limited applications in the real world. The transformer is a popular sequence-to-sequence prediction model in natural language processing (NLP), which contains a global receptive field. In this paper, we propose TransCrowd, which reformulates the weakly-supervised crowd counting problem from the perspective of sequence-to-count based on transformers. We observe that the proposed TransCrowd can effectively extract the semantic crowd information by using the self-attention mechanism of transformer. To the best of our knowledge, this is the first work to adopt a pure transformer for crowd counting research. Experiments on five benchmark datasets demonstrate that the proposed TransCrowd achieves superior performance compared with all the weakly-supervised CNN-based counting methods and gains highly competitive counting performance compared with some popular fully-supervised counting methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Zhang Y, Zhou D, Chen S, et al. Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2016

  2. Li Y, Zhang X, Chen D. CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2018

  3. Xu C, Qiu K, Fu J, et al. Learn to scale: generating multipolar normalized density map for crowd counting. In: Proceedings of IEEE International Conference on Computer Vision, 2019

  4. Liu Z, He Z, Wang L, et al. VisDrone-CC2021: the vision meets drone crowd counting challenge results. In: Proceedings of IEEE International Conference on Computer Vision, 2021. 2830–2838

  5. Bai S, He Z, Qiao Y, et al. Adaptive dilated network with self-correction supervision for counting. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2020

  6. Yang Y, Li G, Wu Z, et al. Weakly-supervised crowd counting learns from sorting rather than locations. In: Proceedings of European Conference on Computer Vision, 2020

  7. Guo B, Wang Z, Yu Z, et al. Mobile crowd sensing and computing. ACM Comput Surv, 2015, 48: 1–31

    Article  Google Scholar 

  8. Sheng X, Tang J, Xiao X J, et al. Leveraging GPS-less sensing scheduling for green mobile crowd sensing. IEEE Internet Things J, 2014, 1: 328–336

    Article  Google Scholar 

  9. Lei Y, Liu Y, Zhang P, et al. Towards using count-level weak supervision for crowd counting. Pattern Recogn, 2021, 109: 107616

    Article  Google Scholar 

  10. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, 2017

  11. Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16 X 16 words: transformers for image recognition at scale. 2021. ArXiv:2010.11929

  12. Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers. In: Proceedings of European Conference on Computer Vision, 2020

  13. Zhu X, Su W, Lu L, et al. Deformable DETR: deformable transformers for end-to-end object detection. In: Proceedings of International Conference on Learning Representations, 2020

  14. Zheng S, Lu J, Zhao H, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2021. 6881–6890

  15. Liu X, Weijer J, Bagdanov A D. Exploiting unlabeled data in CNNs by self-supervised learning to rank. IEEE Trans Pattern Anal Mach Intell, 2019, 41: 1862–1878

    Article  Google Scholar 

  16. Wang Q, Gao J, Lin W, et al. Learning from synthetic data for crowd counting in the wild. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2019

  17. Ren S, He K, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of Advances in Neural Information Processing Systems, 2015

  18. Liu W, Anguelov D, Erhan D, et al. SSD: single shot multibox detector. In: Proceedings of European Conference on Computer Vision, 2016

  19. Abousamra S, Hoai M, Samaras D, et al. Localization in the crowd with topological constraints. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2021

  20. Liu Y, Shi M, Zhao Q, et al. Point in, box out: beyond counting persons in crowds. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2019

  21. Liang D, Xu W, Zhu Y, et al. Focal inverse distance transform maps for crowd localization and counting in dense crowd. 2021. ArXiv:2102.07925

  22. Xu C, Liang D, Xu Y, et al. AutoScale: learning to scale for crowd counting. Int J Comput Vis, 2022, 130: 405–434

    Article  Google Scholar 

  23. Chen Y, Liang D, Bai X, et al. Cell localization and counting using direction field map. IEEE J Biomed Health Inform, 2022, 26: 359–368

    Article  Google Scholar 

  24. Zhang A, Yue L, Shen J, et al. Attentional neural fields for crowd counting. In: Proceedings of IEEE International Conference on Computer Vision, 2019

  25. Du D, Wen L, Zhu P, et al. VisDrone-CC2020: the vision meets drone crowd counting challenge results. In: Proceedings of European Conference on Computer Vision, 2020. 675–691

  26. Ma Z, Wei X, Hong X, et al. Bayesian loss for crowd count estimation with point supervision. In: Proceedings of IEEE International Conference on Computer Vision, 2019

  27. Jiang X, Zhang L, Xu M, et al. Attention scaling for crowd counting. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2020

  28. Xu W, Liang D, Zheng Y, et al. Dilated-scale-aware category-attention ConvNet for multi-class object counting. IEEE Signal Process Lett, 2021, 28: 1570–1574

    Article  Google Scholar 

  29. Sindagi V A, Patel V M. Generating high-quality crowd density maps using contextual pyramid CNNs. In: Proceedings of IEEE International Conference on Computer Vision, 2017

  30. Jiang X, Xiao Z, Zhang B, et al. Crowd counting and density estimation by trellis encoder-decoder networks. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2019

  31. Ma Z, Wei X, Hong X, et al. Learning scales from points: a scale-aware probabilistic model for crowd counting. In: Proceedings of ACM Multimedia, 2020. 220–228

  32. Shi M, Yang Z, Xu C, et al. Revisiting perspective information for efficient crowd counting. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2019

  33. Gao J, Wang Q, Li X. PCC Net: perspective crowd counting via spatial convolutional network. IEEE Trans Circuits Syst Video Technol, 2020, 30: 3486–3498

    Article  Google Scholar 

  34. Yang Y, Li G, Wu Z, et al. Reverse perspective network for perspective-aware object counting. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2020

  35. Lin H, Hong X, Ma Z, et al. Direct measure matching for crowd counting. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence, 2021

  36. Ma Z, Wei X, Hong X, et al. Learning to count via unbalanced optimal transport. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2021. 2319–2327

  37. Liu N, Long Y, Zou C, et al. ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2019

  38. Chan A B, Liang Z S J, Vasconcelos N. Privacy preserving crowd monitoring: counting people without people models or tracking. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2008

  39. von Borstel M, Kandemir M, Schmidt P, et al. Gaussian process density counting from weak supervision. In: Proceedings of European Conference on Computer Vision. Springer, 2016. 365–380

  40. Shang C, Ai H, Bai B. End-to-end crowd counting via joint learning local and global count. In: Proceedings of IEEE International Conference on Image Processing, 2016

  41. Wang C, Zhang H, Yang L, et al. Deep people counting in extremely dense crowds. In: Proceedings of ACM Multimedia, 2015

  42. Devlin J, Chang M-W, Toutanova L K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2019. 4171–4186

  43. Liu Y, Ott M, Goyal N, et al. RoBERTa: a robustly optimized bert pretraining approach. 2019. ArXiv:1907.11692

  44. Wang N, Zhou W, Wang J, et al. Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2021. 1571–1580

  45. Chen H, Wang Y, Guo T, et al. Pre-trained image processing transformer. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2021. 12299–12310

  46. Hendrycks D, Gimpel K. Bridging nonlinearities and stochastic regularizers with Gaussian error linear units. 2016. ArXiv:1606.08415

  47. Kingma D, Ba J. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, 2015

  48. Wang Q, Gao J, Lin W, et al. NWPU-Crowd: a large-scale benchmark for crowd counting and localization. IEEE Trans Pattern Anal Mach Intell, 2021, 43: 2141–2149

    Article  Google Scholar 

  49. Sindagi V A, Yasarla R, Patel V M. JHU-CROWD++: large-scale crowd counting dataset and a benchmark method. IEEE Trans Pattern Anal Mach Intell, 2022, 44: 2594–2609

    Google Scholar 

  50. Idrees H, Tayyab M, Athrey K, et al. Composition loss for counting, density map estimation and localization in dense crowds. In: Proceedings of European Conference on Computer Vision, 2018

  51. Idrees H, Saleemi I, Seibert C, et al. Multi-source multi-scale counting in extremely dense crowd images. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2013

  52. Zhang C, Li H, Wang X, et al. Cross-scene crowd counting via deep convolutional neural networks. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2015. 833–841

  53. Shi Z, Mettes P, Snoek C G. Counting with focus for free. In: Proceedings of IEEE International Conference on Computer Vision, 2019. 4200–4209

  54. Yan Z, Yuan Y, Zuo W, et al. Perspective-guided convolution networks for crowd counting. In: Proceedings of IEEE International Conference on Computer Vision, 2019

  55. Liu L, Lu H, Zou H, et al. Weighing counts: sequential crowd counting by reinforcement learning. In: Proceedings of European Conference on Computer Vision. Springer, 2020. 164–181

  56. Wan J, Chan A. Modeling noisy annotations for crowd counting. In: Proceedings of Advances in Neural Information Processing Systems, 2020

  57. Sindagi V A, Patel V M. CNN-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In: Proceedings of IEEE International Conference on Advanced Video and Signal Based Surveillance, 2017

  58. Liu L, Qiu Z, Li G, et al. Crowd counting with deep structured scale integration network. In: Proceedings of IEEE International Conference on Computer Vision, 2019

  59. Liu W, Salzmann M, Fua P. Context-aware crowd counting. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2019

  60. Cao X, Wang Z, Zhao Y, et al. Scale aggregation network for accurate and efficient crowd counting. In: Proceedings of European Conference on Computer Vision, 2018

  61. Sindagi V A, Patel V M. Multi-level bottom-top and top-bottom feature fusion for crowd counting. In: Proceedings of IEEE International Conference on Computer Vision, 2019

  62. Gao J, Lin W, Zhao B, et al. C3 framework: an open-source PyTorch code for crowd counting. 2019. ArXiv:1907.02724

  63. Wan J, Wang Q, Chan A B. Kernel-based density map generation for dense object counting. IEEE Trans Pattern Anal Mach Intell, 2022, 44: 1357–1370

    Article  Google Scholar 

  64. Wang B, Liu H, Samaras D, et al. Distribution matching for crowd counting. In: Proceedings of Advances in Neural Information Processing Systems, 2020

  65. Zhang S, Wu G, Costeira J P, et al. FCN-rLSTM: deep spatio-temporal neural networks for vehicle counting in city cameras. In: Proceedings of IEEE International Conference on Computer Vision, 2017

  66. Liu J, Gao C, Meng D, et al. DecideNet: counting varying density crowds through attention guided detection and density estimation. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2018

  67. Ranjan V, Le H, Hoai M. Iterative crowd counting. In: Proceedings of European Conference on Computer Vision, 2018

  68. Sam D B, Peri S V, Sundararaman M N, et al. Locate, size and count: accurately resolving people in dense crowds via detection. IEEE Trans Pattern Anal Mach Intell, 2021, 43: 2739–2751

    Google Scholar 

  69. Shi Z, Zhang L, Liu Y, et al. Crowd counting with deep negative correlation learning. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2018

  70. Wan J, Luo W, Wu B, et al. Residual regression with semantic prior for crowd counting. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2019

Download references

Acknowledgements

This work was supported by National Key R&D Program of China (Grant No. 2018YFB1004600).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiang Bai.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liang, D., Chen, X., Xu, W. et al. TransCrowd: weakly-supervised crowd counting with transformers. Sci. China Inf. Sci. 65, 160104 (2022). https://doi.org/10.1007/s11432-021-3445-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-021-3445-y

Keywords

Navigation