Abstract
The mainstream crowd counting methods usually utilize the convolution neural network (CNN) to regress a density map, requiring point-level annotations. However, annotating each person with a point is an expensive and laborious process. During the testing phase, the point-level annotations are not considered to evaluate the counting accuracy, which means the point-level annotations are redundant. Hence, it is desirable to develop weakly-supervised counting methods that just rely on count-level annotations, a more economical way of labeling. Current weakly-supervised counting methods adopt the CNN to regress a total count of the crowd by an image-to-count paradigm. However, having limited receptive fields for context modeling is an intrinsic limitation of these weakly-supervised CNN-based methods. These methods thus cannot achieve satisfactory performance, with limited applications in the real world. The transformer is a popular sequence-to-sequence prediction model in natural language processing (NLP), which contains a global receptive field. In this paper, we propose TransCrowd, which reformulates the weakly-supervised crowd counting problem from the perspective of sequence-to-count based on transformers. We observe that the proposed TransCrowd can effectively extract the semantic crowd information by using the self-attention mechanism of transformer. To the best of our knowledge, this is the first work to adopt a pure transformer for crowd counting research. Experiments on five benchmark datasets demonstrate that the proposed TransCrowd achieves superior performance compared with all the weakly-supervised CNN-based counting methods and gains highly competitive counting performance compared with some popular fully-supervised counting methods.
Similar content being viewed by others
References
Zhang Y, Zhou D, Chen S, et al. Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2016
Li Y, Zhang X, Chen D. CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2018
Xu C, Qiu K, Fu J, et al. Learn to scale: generating multipolar normalized density map for crowd counting. In: Proceedings of IEEE International Conference on Computer Vision, 2019
Liu Z, He Z, Wang L, et al. VisDrone-CC2021: the vision meets drone crowd counting challenge results. In: Proceedings of IEEE International Conference on Computer Vision, 2021. 2830–2838
Bai S, He Z, Qiao Y, et al. Adaptive dilated network with self-correction supervision for counting. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2020
Yang Y, Li G, Wu Z, et al. Weakly-supervised crowd counting learns from sorting rather than locations. In: Proceedings of European Conference on Computer Vision, 2020
Guo B, Wang Z, Yu Z, et al. Mobile crowd sensing and computing. ACM Comput Surv, 2015, 48: 1–31
Sheng X, Tang J, Xiao X J, et al. Leveraging GPS-less sensing scheduling for green mobile crowd sensing. IEEE Internet Things J, 2014, 1: 328–336
Lei Y, Liu Y, Zhang P, et al. Towards using count-level weak supervision for crowd counting. Pattern Recogn, 2021, 109: 107616
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, 2017
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16 X 16 words: transformers for image recognition at scale. 2021. ArXiv:2010.11929
Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers. In: Proceedings of European Conference on Computer Vision, 2020
Zhu X, Su W, Lu L, et al. Deformable DETR: deformable transformers for end-to-end object detection. In: Proceedings of International Conference on Learning Representations, 2020
Zheng S, Lu J, Zhao H, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2021. 6881–6890
Liu X, Weijer J, Bagdanov A D. Exploiting unlabeled data in CNNs by self-supervised learning to rank. IEEE Trans Pattern Anal Mach Intell, 2019, 41: 1862–1878
Wang Q, Gao J, Lin W, et al. Learning from synthetic data for crowd counting in the wild. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2019
Ren S, He K, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of Advances in Neural Information Processing Systems, 2015
Liu W, Anguelov D, Erhan D, et al. SSD: single shot multibox detector. In: Proceedings of European Conference on Computer Vision, 2016
Abousamra S, Hoai M, Samaras D, et al. Localization in the crowd with topological constraints. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2021
Liu Y, Shi M, Zhao Q, et al. Point in, box out: beyond counting persons in crowds. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2019
Liang D, Xu W, Zhu Y, et al. Focal inverse distance transform maps for crowd localization and counting in dense crowd. 2021. ArXiv:2102.07925
Xu C, Liang D, Xu Y, et al. AutoScale: learning to scale for crowd counting. Int J Comput Vis, 2022, 130: 405–434
Chen Y, Liang D, Bai X, et al. Cell localization and counting using direction field map. IEEE J Biomed Health Inform, 2022, 26: 359–368
Zhang A, Yue L, Shen J, et al. Attentional neural fields for crowd counting. In: Proceedings of IEEE International Conference on Computer Vision, 2019
Du D, Wen L, Zhu P, et al. VisDrone-CC2020: the vision meets drone crowd counting challenge results. In: Proceedings of European Conference on Computer Vision, 2020. 675–691
Ma Z, Wei X, Hong X, et al. Bayesian loss for crowd count estimation with point supervision. In: Proceedings of IEEE International Conference on Computer Vision, 2019
Jiang X, Zhang L, Xu M, et al. Attention scaling for crowd counting. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2020
Xu W, Liang D, Zheng Y, et al. Dilated-scale-aware category-attention ConvNet for multi-class object counting. IEEE Signal Process Lett, 2021, 28: 1570–1574
Sindagi V A, Patel V M. Generating high-quality crowd density maps using contextual pyramid CNNs. In: Proceedings of IEEE International Conference on Computer Vision, 2017
Jiang X, Xiao Z, Zhang B, et al. Crowd counting and density estimation by trellis encoder-decoder networks. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2019
Ma Z, Wei X, Hong X, et al. Learning scales from points: a scale-aware probabilistic model for crowd counting. In: Proceedings of ACM Multimedia, 2020. 220–228
Shi M, Yang Z, Xu C, et al. Revisiting perspective information for efficient crowd counting. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2019
Gao J, Wang Q, Li X. PCC Net: perspective crowd counting via spatial convolutional network. IEEE Trans Circuits Syst Video Technol, 2020, 30: 3486–3498
Yang Y, Li G, Wu Z, et al. Reverse perspective network for perspective-aware object counting. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2020
Lin H, Hong X, Ma Z, et al. Direct measure matching for crowd counting. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence, 2021
Ma Z, Wei X, Hong X, et al. Learning to count via unbalanced optimal transport. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2021. 2319–2327
Liu N, Long Y, Zou C, et al. ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2019
Chan A B, Liang Z S J, Vasconcelos N. Privacy preserving crowd monitoring: counting people without people models or tracking. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2008
von Borstel M, Kandemir M, Schmidt P, et al. Gaussian process density counting from weak supervision. In: Proceedings of European Conference on Computer Vision. Springer, 2016. 365–380
Shang C, Ai H, Bai B. End-to-end crowd counting via joint learning local and global count. In: Proceedings of IEEE International Conference on Image Processing, 2016
Wang C, Zhang H, Yang L, et al. Deep people counting in extremely dense crowds. In: Proceedings of ACM Multimedia, 2015
Devlin J, Chang M-W, Toutanova L K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2019. 4171–4186
Liu Y, Ott M, Goyal N, et al. RoBERTa: a robustly optimized bert pretraining approach. 2019. ArXiv:1907.11692
Wang N, Zhou W, Wang J, et al. Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2021. 1571–1580
Chen H, Wang Y, Guo T, et al. Pre-trained image processing transformer. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2021. 12299–12310
Hendrycks D, Gimpel K. Bridging nonlinearities and stochastic regularizers with Gaussian error linear units. 2016. ArXiv:1606.08415
Kingma D, Ba J. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, 2015
Wang Q, Gao J, Lin W, et al. NWPU-Crowd: a large-scale benchmark for crowd counting and localization. IEEE Trans Pattern Anal Mach Intell, 2021, 43: 2141–2149
Sindagi V A, Yasarla R, Patel V M. JHU-CROWD++: large-scale crowd counting dataset and a benchmark method. IEEE Trans Pattern Anal Mach Intell, 2022, 44: 2594–2609
Idrees H, Tayyab M, Athrey K, et al. Composition loss for counting, density map estimation and localization in dense crowds. In: Proceedings of European Conference on Computer Vision, 2018
Idrees H, Saleemi I, Seibert C, et al. Multi-source multi-scale counting in extremely dense crowd images. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2013
Zhang C, Li H, Wang X, et al. Cross-scene crowd counting via deep convolutional neural networks. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2015. 833–841
Shi Z, Mettes P, Snoek C G. Counting with focus for free. In: Proceedings of IEEE International Conference on Computer Vision, 2019. 4200–4209
Yan Z, Yuan Y, Zuo W, et al. Perspective-guided convolution networks for crowd counting. In: Proceedings of IEEE International Conference on Computer Vision, 2019
Liu L, Lu H, Zou H, et al. Weighing counts: sequential crowd counting by reinforcement learning. In: Proceedings of European Conference on Computer Vision. Springer, 2020. 164–181
Wan J, Chan A. Modeling noisy annotations for crowd counting. In: Proceedings of Advances in Neural Information Processing Systems, 2020
Sindagi V A, Patel V M. CNN-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In: Proceedings of IEEE International Conference on Advanced Video and Signal Based Surveillance, 2017
Liu L, Qiu Z, Li G, et al. Crowd counting with deep structured scale integration network. In: Proceedings of IEEE International Conference on Computer Vision, 2019
Liu W, Salzmann M, Fua P. Context-aware crowd counting. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2019
Cao X, Wang Z, Zhao Y, et al. Scale aggregation network for accurate and efficient crowd counting. In: Proceedings of European Conference on Computer Vision, 2018
Sindagi V A, Patel V M. Multi-level bottom-top and top-bottom feature fusion for crowd counting. In: Proceedings of IEEE International Conference on Computer Vision, 2019
Gao J, Lin W, Zhao B, et al. C3 framework: an open-source PyTorch code for crowd counting. 2019. ArXiv:1907.02724
Wan J, Wang Q, Chan A B. Kernel-based density map generation for dense object counting. IEEE Trans Pattern Anal Mach Intell, 2022, 44: 1357–1370
Wang B, Liu H, Samaras D, et al. Distribution matching for crowd counting. In: Proceedings of Advances in Neural Information Processing Systems, 2020
Zhang S, Wu G, Costeira J P, et al. FCN-rLSTM: deep spatio-temporal neural networks for vehicle counting in city cameras. In: Proceedings of IEEE International Conference on Computer Vision, 2017
Liu J, Gao C, Meng D, et al. DecideNet: counting varying density crowds through attention guided detection and density estimation. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2018
Ranjan V, Le H, Hoai M. Iterative crowd counting. In: Proceedings of European Conference on Computer Vision, 2018
Sam D B, Peri S V, Sundararaman M N, et al. Locate, size and count: accurately resolving people in dense crowds via detection. IEEE Trans Pattern Anal Mach Intell, 2021, 43: 2739–2751
Shi Z, Zhang L, Liu Y, et al. Crowd counting with deep negative correlation learning. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2018
Wan J, Luo W, Wu B, et al. Residual regression with semantic prior for crowd counting. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2019
Acknowledgements
This work was supported by National Key R&D Program of China (Grant No. 2018YFB1004600).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liang, D., Chen, X., Xu, W. et al. TransCrowd: weakly-supervised crowd counting with transformers. Sci. China Inf. Sci. 65, 160104 (2022). https://doi.org/10.1007/s11432-021-3445-y
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-021-3445-y