Skip to main content

HTNet: A Hybrid Model Boosted by Triple Self-attention for Crowd Counting

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14436))

Included in the following conference series:

  • 435 Accesses

Abstract

The swift development of convolutional neural network (CNN) has enabled significant headway in crowd counting research. However, the fixed-size convolutional kernels of traditional methods make it difficult to handle problems such as drastic scale change and complex background interference. In this regard, we propose a hybrid crowd counting model to tackle existing challenges. Firstly, we leverage a global self-attention module (GAM) after CNN backbone to capture wider contextual information. Secondly, due to the gradual recovery of the feature map size in the decoding stage, the local self-attention module (LAM) is employed to reduce computational complexity. With this design, the model can fuse features from global and local perspectives to better cope with scale change. Additionally, to establish the interdependence between spatial and channel dimensions, we further design a novel channel self-attention module (CAM) and combine it with LAM. Finally, we construct a simple yet useful double head module that outputs a foreground segmentation map in addition to the intermediate density map, which are then multiplied together in a pixel-wise style to suppress background interference. The experimental results on several benchmark datasets demonstrate that our method achieves remarkable improvement.

This work was supported in part by the National Natural Science Foundation of China under Grant 62133013 and in part by the Chinese Association for Artificial Intelligence (CAAI)-Huawei MindSpore Open Fund.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: CVPR, pp. 589–597 (2016)

    Google Scholar 

  2. Li, Y., Zhang, X., Chen, D.: CSRnet: dilated convolutional neural networks for understanding the highly congested scenes. In: CVPR, pp. 1091–1100 (2018)

    Google Scholar 

  3. Sindagi, V., Patel, V.: Multi-level bottom-top and top-bottom feature fusion for crowd counting. In: ICCV, pp. 1002–1012 (2019)

    Google Scholar 

  4. Song, Q., et al.: To choose or to fuse? Scale selection for crowd counting. In: AAAI, pp. 2576–2583 (2021)

    Google Scholar 

  5. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  6. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  7. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 9992–10002 (2021)

    Google Scholar 

  8. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  9. Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6877–6886 (2021)

    Google Scholar 

  10. Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollar, P., Girshick, R.: Early convolutions help transformers see better. In: NeurIPS, pp. 30392–30400 (2021)

    Google Scholar 

  11. Sam, D.B., Surya, S., Babu, R.V.: Switching convolutional neural network for crowd counting. In: CVPR, pp. 4031–4039 (2017)

    Google Scholar 

  12. Chen, X., Bin, Y., Sang, N., Gao, C.: Scale pyramid network for crowd counting. In: WACV, pp. 1941–1950 (2019)

    Google Scholar 

  13. Jiang, X., et al.: Attention scaling for crowd counting. In: CVPR, pp. 4705–4714 (2020)

    Google Scholar 

  14. Rong, L., Li, C.: Coarse- and fine-grained attention network with background-aware loss for crowd density map estimation. In: WACV, pp. 3674–3683 (2021)

    Google Scholar 

  15. Yan, Z., et al.: Perspective-guided convolution networks for crowd counting. In: ICCV, pp. 952–961 (2019)

    Google Scholar 

  16. Yan, Z., Zhang, R., Zhang, H., Zhang, Q., Zuo, W.: Crowd counting via perspective-guided fractional-dilation convolution. IEEE Trans. Multimedia, pp. 2633–2647 (2022)

    Google Scholar 

  17. Yang, S., Guo, W., Ren, Y.: Crowdformer: an overlap patching vision transformer for top-down crowd counting. In: IJCAI, pp. 1545–1551 (2022)

    Google Scholar 

  18. Lin, H., Ma, Z., Ji, R., Wang, Y., Hong, X.: Boosting crowd counting via multifaceted attention. In: CVPR, pp. 19596–19605 (2022)

    Google Scholar 

  19. Qian, Y., Zhang, L., Hong, X., Donovan, C., Arandjelovic, O.: Segmentation assisted u-shaped multi-scale transformer for crowd counting. In: BMVC (2022)

    Google Scholar 

  20. Chu, X., et al.: Twins: Revisiting the design of spatial attention in vision transformers. In: NeurIPS, pp. 9355–9366 (2021)

    Google Scholar 

  21. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  22. Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, pp. 936–944 (2017)

    Google Scholar 

  23. Chu, X., et al.: Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882 (2021)

  24. Li, X., et al.: Semantic flow for fast and accurate scene parsing. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 775–793. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_45

    Chapter  Google Scholar 

  25. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2999–3007 (2017)

    Google Scholar 

  26. Idrees, H., et al.: Composition loss for counting, density map estimation and localization in dense crowds. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 544–559. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_33

    Chapter  Google Scholar 

  27. Sindagi, V.A., Yasarla, R., Patel, V.M.: Jhu-crowd++: large-scale crowd counting dataset and a benchmark method. Technical report (2020)

    Google Scholar 

  28. Wang, Q., Gao, J., Lin, W., Li, X.: NWPU-crowd: a large-scale benchmark for crowd counting and localization. IEEE Trans. Pattern Anal. Mach. Intell. 2141–2149 (2021)

    Google Scholar 

  29. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)

    Google Scholar 

  30. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  31. Ma, Z., Wei, X., Hong, X., Gong, Y.: Bayesian loss for crowd count estimation with point supervision. In: ICCV, pp. 6141–6150 (2019)

    Google Scholar 

  32. Wang, B., Liu, H., Samaras, D., Nguyen, M.H.: Distribution matching for crowd counting. In: NeurIPS, pp. 1595–1607 (2020)

    Google Scholar 

  33. Liu, H., Zhao, Q., Ma, Y., Dai, F.: Bipartite matching for crowd counting with point supervision. In: IJCAI, pp. 860–866 (2021)

    Google Scholar 

  34. Lin, H., et al.: Direct measure matching for crowd counting. In: IJCAI, pp. 837–844 (2021)

    Google Scholar 

  35. Liang, D., Xu, W., Bai, X.: An end-to-end transformer model for crowd localization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13661, pp. 38–54. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19769-7_3

    Chapter  Google Scholar 

  36. Shu, W., Wan, J., Tan, K.C., Kwong, S., Chan, A.B.: Crowd counting in the frequency domain. In: CVPR, pp. 19586–19595 (2022)

    Google Scholar 

  37. Cheng, Z.Q., Dai, Q., Li, H., Song, J., Wu, X., Hauptmann, A.G.: Rethinking spatial invariance of convolutional networks for object counting. In: CVPR, pp. 19606–19616 (2022)

    Google Scholar 

  38. Song, Q., et al.: Rethinking counting and localization in crowds: a purely point-based framework. In: ICCV, pp. 3345–3354 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Baoqun Yin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Y., Yin, B. (2024). HTNet: A Hybrid Model Boosted by Triple Self-attention for Crowd Counting. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14436. Springer, Singapore. https://doi.org/10.1007/978-981-99-8555-5_23

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8555-5_23

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8554-8

  • Online ISBN: 978-981-99-8555-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics