DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

Miao, Zhuangzhuang; Zhang, Yong; Peng, Yuan; Peng, Haocheng; Yin, Baocai

doi:10.1007/s41095-022-0313-5

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

Research Article
Open access
Published: 02 April 2023

Volume 9, pages 859–873, (2023)
Cite this article

Download PDF

You have full access to this open access article

Computational Visual Media Aims and scope Submit manuscript

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

Download PDF

Zhuangzhuang Miao¹,
Yong Zhang¹,
Yuan Peng²,
Haocheng Peng¹ &
…
Baocai Yin¹

1112 Accesses
2 Citations
Explore all metrics

Abstract

Crowd counting provides an important foundation for public security and urban management. Due to the existence of small targets and large density variations in crowd images, crowd counting is a challenging task. Mainstream methods usually apply convolution neural networks (CNNs) to regress a density map, which requires annotations of individual persons and counts. Weakly-supervised methods can avoid detailed labeling and only require counts as annotations of images, but existing methods fail to achieve satisfactory performance because a global perspective field and multi-level information are usually ignored. We propose a weakly-supervised method, DTCC, which effectively combines multi-level dilated convolution and transformer methods to realize end-to-end crowd counting. Its main components include a recursive swin transformer and a multi-level dilated convolution regression head. The recursive swin transformer combines a pyramid visual transformer with a fine-tuned recursive pyramid structure to capture deep multi-level crowd features, including global features. The multi-level dilated convolution regression head includes multi-level dilated convolution and a linear regression head for the feature extraction module. This module can capture both low- and high-level features simultaneously to enhance the receptive field. In addition, two regression head fusion mechanisms realize dynamic and mean fusion counting. Experiments on four well-known benchmark crowd counting datasets (UCF_CC_50, ShanghaiTech, UCF_QNRF, and JHU-Crowd++) show that DTCC achieves results superior to other weakly-supervised methods and comparable to fully-supervised methods.

Article PDF

PDDNet: lightweight congested crowd counting via pyramid depth-wise dilated convolution

Article 19 August 2022

Pyramid-dilated deep convolutional neural network for crowd counting

Article 29 May 2021

Effective use of convolutional neural networks and diverse deep supervision for better crowd counting

Article 17 January 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Li, M.; Zhang, Z. X.; Huang, K. Q.; Tan, T. N. Estimating the number of people in crowded scenes by MID based foreground segmentation and head-shoulder detection. In: Proceedings of the 19th International Conference on Pattern Recognition, 1–4, 2008.
Wu, B.; Nevatia, R. Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors. International Journal of Computer Vision Vol. 75, No. 2, 247–266, 2007.
Article Google Scholar
Lempitsky, V. S.; Zisserman, A. Learning to count objects in images. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems, Vol. 1, 1324–1332, 2010.
Google Scholar
Walach, E.; Wolf, L. Learning to count with CNN boosting. In: Computer Vision — ECCV 2016. Lecture Notes in Computer Science, Vol. 9906. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 660–676, 2016.
Chapter Google Scholar
Wang, C.; Zhang, H.; Yang, L.; Liu, S.; Cao, X. C. Deep people counting in extremely dense crowds. In: Proceedings of the 23rd ACM International Conference on Multimedia, 1299–1302, 2015.
Fu, M.; Xu, P.; Li, X. D.; Liu, Q. H.; Ye, M.; Zhu, C. Fast crowd density estimation with convolutional neural networks. Engineering Applications of Artificial Intelligence Vol. 43, 81–88, 2015.
Article Google Scholar
Song, Q. Y.; Wang, C. G.; Jiang, Z. K.; Wang, Y. B.; Tai, Y.; Wang, C. J.; Li, J. L.; Huang, F. Y.; Wu, Y. Rethinking counting and localization in crowds: A purely point-based framework. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3345–3354, 2021.
Meng, Y. D.; Zhang, H. R.; Zhao, Y. T.; Yang, X. Y.; Qian, X. S.; Huang, X. W.; Zheng, Y. Spatial uncertainty-aware semi-supervised crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 15529–15539, 2021.
Wan, J.; Liu, Z. Q.; Chan, A. B. A generalized loss function for crowd counting and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1974–1983, 2021.
Liu, X. L.; van de Weijer, J.; Bagdanov, A. D. Exploiting unlabeled data in CNNs by self-supervised learning to rank. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 41, No. 8, 1862–1878, 2019.
Article Google Scholar
Wang, Q.; Gao, J. Y.; Lin, W.; Yuan, Y. Learning from synthetic data for crowd counting in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8190–8199, 2019.
Liang, D. K.; Chen, X. W.; Xu, W.; Zhou, Y.; Bai, X. TransCrowd: Weakly-supervised crowd counting with transformers. Science China Information Sciences Vol. 65, No. 6, Article No. 160104, 2022.
Liu, Z.; Lin, Y. T.; Cao, Y.; Hu, H.; Wei, Y. X.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9992–10002, 2021.
Chen, C. F R.; Fan, Q. F.; Panda, R. CrossViT: Cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 347–356, 2021.
Huang, Z.; Ben, Y.; Luo, G.; Cheng, P.; Yu, G.; Fu, B. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650, 2021.
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 213–229, 2020.
Chapter Google Scholar
He, L.; Zhou, Q. Y.; Li, X. T.; Niu, L.; Cheng, G. L.; Li, X.; Liu, W.; Tong, Y.; Ma, L.; Zhang, L. End-to-end video object detection with spatial-temporal transformers. In: Proceedings of the 29th ACM International Conference on Multimedia, 1507–1516, 2021.
Zhang, Y. Y.; Zhou, D. S.; Chen, S. Q.; Gao, S. H.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 589–597, 2016.
Sam, D. B.; Surya, S.; Babu, R. V. Switching convolutional neural network for crowd counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4031–4039, 2017.
Li, Y. H.; Zhang, X. F.; Chen, D. M. CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1091–1100, 2018.
Ma, Z. H.; Wei, X.; Hong, X. P.; Gong, Y. H. Bayesian loss for crowd count estimation with point supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6141–6150, 2019.
Liu, Z.; He, Z.; Wang, L.; Wang, W.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van Gool, L.; Han, J.; et al. VisDrone-CC2021: The vision meets drone crowd counting challenge results. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2830–2838, 2021.
Liang, D.; Xu, W.; Bai, X. An end-to-end transformer model for crowd localization. arXiv preprint arXiv:2202.13065, 2022.
Abousamra, S.; Hoai, M.; Samaras, D.; Chen, C. Localization in the crowd with topological constraints. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 2, 872–881, 2021.
Article Google Scholar
Sun, G. L.; Liu, Y.; Probst, T.; Paudel, D. P.; Popovic, N.; Van Gool, L. Boosting crowd counting with transformers. arXiv preprint arXiv:2105.10926, 2021.
Gao, J. Y.; Gong, M. G.; Li, X. L. Congested crowd instance localization with dilated convolutional swin transformer. arXiv preprint arXiv:2108.00584, 2021.
Shang, C.; Ai, H. Z.; Bai, B. End-to-end crowd counting via joint learning local and global count. In: Proceedings of the IEEE International Conference on Image Processing, 1215–1219, 2016.
Wang, M. J.; Zhou, J.; Cai, H.; Gong, M. L. CrowdMLP: Weakly-supervised crowd counting via multi-granularity MLP. arXiv preprint arXiv: 2203.08219, 2022.
Lei, Y. J.; Liu, Y.; Zhang, P. P.; Liu, L. Q. Towards using count-level weak supervision for crowd counting. Pattern Recognition Vol. 109, 107616, 2021.
Article Google Scholar
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X. H.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations, 2021.
Tian, Y.; Chu, X.; Wang, H. CCTrans: Simplifying and improving crowd counting with transformer. arXiv preprint arXiv:2109.14483, 2021.
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. In: Proceedings of the Advances in Neural Information Processing Systems, Vol. 34, 9355–9366, 2021.
Google Scholar
Girshick, R. Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 1440–1448, 2015.
Idrees, H.; Saleemi, I.; Seibert, C.; Shah, M. Multi-source multi-scale counting in extremely dense crowd images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2547–2554, 2013.
Zhang, Y. Y.; Zhou, D. S.; Chen, S. Q.; Gao, S. H.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 589–597, 2016.
Sindagi, V. A.; Yasarla, R.; Patel, V. M. JHU-CROWD: Large-scale crowd counting dataset and a benchmark method. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 5, 2594–2609, 2022.
Google Scholar
Liu, W. Z.; Salzmann, M.; Fua, P. Context-aware crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5094–5103, 2020.
Bai, S.; He, Z. Q.; Qiao, Y.; Hu, H. Z.; Wu, W.; Yan, J. J. Adaptive dilated network with self-correction supervision for counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4593–4602, 2020.
Shi, M. J.; Yang, Z. H.; Xu, C.; Chen, Q. J. Revisiting perspective information for efficient crowd counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7271–7280, 2019.
Xiong, H. P.; Lu, H.; Liu, C. X.; Liu, L.; Cao, Z. G.; Shen, C. H. From open set to closed set: Counting objects by spatial divide-and-conquer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 8361–8370, 2019.
Song, Q.; Wang, C.; Jiang, Z.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Wu, Y. Rethinking counting and localization in crowds: A purely point-based framework. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3345–3354, 2021.
Yang, Y.; Li, G.; Wu, Z.; Su, L.; Huang, Q.; Sebe, N. Weakly-supervised crowd counting learns from sorting rather than locations. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12353. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 1–17, 2020.
Chapter Google Scholar
Sindagi, V. A.; Patel, V. M. CNN-based cascaded multitask learning of high-level prior and density estimation for crowd counting. In: Proceedings of the 14th IEEE International Conference on Advanced Video and Signal Based Surveillance, 1–6, 2017.
Sindagi, V. A.; Patel, V. M. Generating high-quality crowd density maps using contextual pyramid CNNs. In: Proceedings of the IEEE International Conference on Computer Vision, 1879–1888, 2017.
Shen, Z.; Xu, Y.; Ni, B. B.; Wang, M. S.; Hu, J. G.; Yang, X. K. Crowd counting via adversarial cross-scale consistency pursuit. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5245–5254, 2018.
Qiao, S. Y.; Chen, L. C.; Yuille, A. DetectoRS: Detecting objects with recursive feature pyramid and switchable atrous convolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10208–10219, 2021.
Yang, Y. F.; Li, G. R.; Wu, Z.; Su, L.; Huang, Q. M.; Sebe, N. Reverse perspective network for perspective-aware object counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4373–4382, 2020.
Wan, J.; Liu, Z. Q.; Chan, A. B. A generalized loss function for crowd counting and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1974–1983, 2021.
Liu, L. B.; Qiu, Z. L.; Li, G. B.; Liu, S. F.; Ouyang, W. L.; Lin, L. Crowd counting with deep structured scale integration network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1774–1783, 2019.
Cao, X.; Wang, Z.; Zhao, Y.; Su, F. Scale aggregation network for accurate and efficient crowd counting. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11209. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 757–773, 2018.
Chapter Google Scholar
Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; Shah, M. Composition loss for counting, density map estimation and localization in dense crowds. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11206. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 544–559, 2018.
Chapter Google Scholar
Savner, S. S.; Kanhangad, V. CrowdFormer: Weakly-supervised crowd counting with improved generalizability. arXiv preprint arXiv:2203.03768, 2022.
Wang, F. S.; Liu, K.; Long, F.; Sang, N.; Xia, X. F.; Sang, J. Joint CNN and transformer network via weakly supervised learning for efficient crowd counting. arXiv preprint arXiv:2203.06388, 2022.
Song, Q.; Wang, C.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Wu, J.; Ma, J. To choose or to fuse? Scale selection for crowd counting. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 3, 2576–2583, 2021.
Article Google Scholar
Sindagi, V. A.; Patel, V. M. Multi-level bottom-top and top-bottom feature fusion for crowd counting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1002–1012, 2019.

Download references

Acknowledgements

This research project was partially supported by the National Natural Science Foundation of China (Grant Nos. 62072015, U19B2039, U1811463), and the National Key R&D Program of China (Grant No. 2018YFB1600903).

A portion of the work in this paper was carried out using the Taiji machine learning engine, and we thank Taiji for their support.

Author information

Authors and Affiliations

Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
Zhuangzhuang Miao, Yong Zhang, Haocheng Peng & Baocai Yin
Taiji Computer Corporation Ltd., Beijing, China
Yuan Peng

Authors

Zhuangzhuang Miao
View author publications
You can also search for this author in PubMed Google Scholar
Yong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Peng
View author publications
You can also search for this author in PubMed Google Scholar
Haocheng Peng
View author publications
You can also search for this author in PubMed Google Scholar
Baocai Yin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yong Zhang.

Ethics declarations

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Zhuangzhuang Miao is a master student in the Faculty of Information Technology of Beijing University of Technology (BJUT). He got his B.S. degree from Shijiazhuang University in 2020. His research interests include deep learning and computer graphics.

Yong Zhang received his Ph.D. degree in computer science from BJUT in 2010. He is currently an associate professor of computer science at BJUT. His research interests include intelligent transportation systems, big data analysis, visualization, and computer graphics.

Yuan Peng received his M.S. degree in software engineering and IT methods applied to business management from Jules Verne University of Picardy, France in 2011 and 2012, respectively. He is currently a senior engineer in China Electronics Technology Group. His current research interests include geographic information systems, air traffic control, computer graphics, atmospheric operation modes, and radar echos.

Haocheng Peng is currently studying for a bachelor degree in IoT in Beijing Dublin International College. His current research interests include deep learning and block chains.

Baocai Yin received his B.S., M.S., and Ph.D. degrees in computational mathematics from Dalian University of Technology, China, in 1985, 1988, and 1993, respectively. He is currently a professor in the Beijing Key Laboratory of Multimedia and Intelligent Software Technology, BJUT. His research interests include multimedia, image processing, computer vision, and pattern recognition.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Reprints and permissions

About this article

Cite this article

Miao, Z., Zhang, Y., Peng, Y. et al. DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting. Comp. Visual Media 9, 859–873 (2023). https://doi.org/10.1007/s41095-022-0313-5

Download citation

Received: 22 March 2022
Accepted: 12 September 2022
Published: 02 April 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s41095-022-0313-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

Abstract

Article PDF

Similar content being viewed by others

PDDNet: lightweight congested crowd counting via pyramid depth-wise dilated convolution

Pyramid-dilated deep convolutional neural network for crowd counting

Effective use of convolutional neural networks and diverse deep supervision for better crowd counting

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

Abstract

Article PDF

Similar content being viewed by others

PDDNet: lightweight congested crowd counting via pyramid depth-wise dilated convolution

Pyramid-dilated deep convolutional neural network for crowd counting

Effective use of convolutional neural networks and diverse deep supervision for better crowd counting

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation