Skip to main content

Multi-scale Context Aggregation for Video-Based Person Re-Identification

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1968))

Included in the following conference series:

  • 527 Accesses

Abstract

For video-based person re-identification (Re-ID), effectively aggregating video features is the key to dealing with various complicated situations. Different from previous methods that first extracted spatial features and later aggregated temporal features, we propose a Multi-scale Context Aggregation (MSCA) method in this paper to simultaneously learn spatial-temporal features from videos. Specifically, we design an Attention-aided Feature Pyramid Network (AFPN), which can recurrently aggregate detail and semantic information of multi-scale feature maps from the CNN backbone. To enable the aggregation to focus on more salient regions in the video, we embed a particular Spatial-Channel Attention module (SCA) into each layer of the pyramid. To further enhance the feature representations with temporal information while extracting the spatial features, we design a Temporal Enhancement module (TEM), which can plug into each layer of the backbone network in a plug-and-play manner. Comprehensive experiments on three standard video-based person Re-ID benchmarks demonstrate that our method is competitive with most state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Chen, X., et al.: Salience-guided cascaded suppression network for person re-identification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3297–3307 (2020)

    Google Scholar 

  2. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv:2010.11929 (2020)

  3. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2015)

    Google Scholar 

  4. Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv:1703.07737 (2017)

  5. Hirzer, M., Beleznai, C., Roth, P.M., Bischof, H.: Person re-identification by descriptive and discriminative classification. In: Scandinavian Conference on Image Analysis (2011)

    Google Scholar 

  6. Hou, R., Chang, H., Ma, B., Huang, R., Shan, S.: BiCnet-TKS: learning efficient spatial-temporal representation for video person re-identification. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2014–2023 (2021)

    Google Scholar 

  7. Hou, R., Chang, H., Ma, B., Shan, S., Chen, X.: Temporal complementary learning for video person re-identification. In: European Conference on Computer Vision (2020)

    Google Scholar 

  8. Jiang, X., Qiao, Y., Yan, J., Li, Q., Zheng, W., Chen, D.: SSN3D: self-separated network to align parts for 3d convolution in video person re-identification. In: AAAI Conference on Artificial Intelligence (2021)

    Google Scholar 

  9. Kong, T., Sun, F., bing Huang, W., Liu, H.: Deep feature pyramid reconfiguration for object detection. arxiv:1808.07993 (2018)

  10. Li, J., Wang, J., Tian, Q., Gao, W., Zhang, S.: Global-local temporal representations for video person re-identification. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3957–3966 (2019)

    Google Scholar 

  11. Li, S., Bak, S., Carr, P., Wang, X.: Diversity regularized spatiotemporal attention for video-based person re-identification. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 369–378 (2018)

    Google Scholar 

  12. Lin, T.Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 936–944 (2016)

    Google Scholar 

  13. Liu, J., Zha, Z., Wu, W., Zheng, K., Sun, Q.: Spatial-temporal correlation and topology learning for person re-identification in videos. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4368–4377 (2021)

    Google Scholar 

  14. Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8759–8768 (2018)

    Google Scholar 

  15. Liu, X., Zhang, P., Yu, C., Lu, H., Yang, X.: Watching you: global-guided reciprocal learning for video-based person re-identification. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13329–13338 (2021)

    Google Scholar 

  16. Liu, Z., Zhang, L., Yang, Y.: Hierarchical bi-directional feature perception network for person re-identification. In: Proceedings of the 28th ACM International Conference on Multimedia (2020)

    Google Scholar 

  17. Pan, H., Chen, Y., He, Z.: Multi-granularity graph pooling for video-based person re-identification. Neural Netw. : Off. J. Int. Neural Netw. Soc. 160, 22–33 (2022)

    Article  Google Scholar 

  18. Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vision 128, 336–359 (2016)

    Article  Google Scholar 

  19. Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: convolutional block attention module. In: European Conference on Computer Vision (2018)

    Google Scholar 

  20. Wu, L., Wang, Y., Gao, J., Li, X.: Where-and-when to look: deep Siamese attention networks for video-based person re-identification. IEEE Trans. Multimed. 21, 1412–1424 (2018)

    Article  Google Scholar 

  21. Yan, Y., et al.: Learning multi-granular hypergraphs for video-based person re-identification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2896–2905 (2020)

    Google Scholar 

  22. Zang, X., Li, G., Gao, W.: Multidirection and multiscale pyramid in transformer for video-based pedestrian retrieval. IEEE Trans. Industr. Inf. 18(12), 8776–8785 (2022). https://doi.org/10.1109/TII.2022.3151766

    Article  Google Scholar 

  23. Zhang, Z., Lan, C., Zeng, W., Chen, Z.: Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10404–10413 (2020). https://doi.org/10.1109/CVPR42600.2020.01042

  24. Zhao, Q., et al.: M2det: a single-shot object detector based on multi-level feature pyramid network. In: AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  25. Zhao, Y., Shen, X., Jin, Z., Lu, H., Hua, X.: Attribute-driven feature disentangling and temporal aggregation for video person re-identification. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4908–4917 (2019)

    Google Scholar 

  26. Zheng, L., et al.: Mars: a video benchmark for large-scale person re-identification. In: European Conference on Computer Vision (2016)

    Google Scholar 

  27. Zhou, K., Yang, Y., Cavallaro, A., Xiang, T.: Omni-scale feature learning for person re-identification. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3701–3711 (2019)

    Google Scholar 

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China (Nos. 62266009, 62276073, 61966004, 61962007), Guangxi Natural Science Foundation (Nos. 2019GXNSFDA245018, 2018GXNSFDA281009, 2018GXNSFDA294001), Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing, Innovation Project of Guangxi Graduate Education (YCSW2023187), and Guangxi “Bagui Scholar” Teams for Innovation and Research Project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Canlong Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wu, L., Zhang, C., Li, Z., Hu, L. (2024). Multi-scale Context Aggregation for Video-Based Person Re-Identification. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1968. Springer, Singapore. https://doi.org/10.1007/978-981-99-8181-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8181-6_8

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8180-9

  • Online ISBN: 978-981-99-8181-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics