Abstract
For video-based person re-identification (Re-ID), effectively aggregating video features is the key to dealing with various complicated situations. Different from previous methods that first extracted spatial features and later aggregated temporal features, we propose a Multi-scale Context Aggregation (MSCA) method in this paper to simultaneously learn spatial-temporal features from videos. Specifically, we design an Attention-aided Feature Pyramid Network (AFPN), which can recurrently aggregate detail and semantic information of multi-scale feature maps from the CNN backbone. To enable the aggregation to focus on more salient regions in the video, we embed a particular Spatial-Channel Attention module (SCA) into each layer of the pyramid. To further enhance the feature representations with temporal information while extracting the spatial features, we design a Temporal Enhancement module (TEM), which can plug into each layer of the backbone network in a plug-and-play manner. Comprehensive experiments on three standard video-based person Re-ID benchmarks demonstrate that our method is competitive with most state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chen, X., et al.: Salience-guided cascaded suppression network for person re-identification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3297–3307 (2020)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv:2010.11929 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2015)
Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv:1703.07737 (2017)
Hirzer, M., Beleznai, C., Roth, P.M., Bischof, H.: Person re-identification by descriptive and discriminative classification. In: Scandinavian Conference on Image Analysis (2011)
Hou, R., Chang, H., Ma, B., Huang, R., Shan, S.: BiCnet-TKS: learning efficient spatial-temporal representation for video person re-identification. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2014–2023 (2021)
Hou, R., Chang, H., Ma, B., Shan, S., Chen, X.: Temporal complementary learning for video person re-identification. In: European Conference on Computer Vision (2020)
Jiang, X., Qiao, Y., Yan, J., Li, Q., Zheng, W., Chen, D.: SSN3D: self-separated network to align parts for 3d convolution in video person re-identification. In: AAAI Conference on Artificial Intelligence (2021)
Kong, T., Sun, F., bing Huang, W., Liu, H.: Deep feature pyramid reconfiguration for object detection. arxiv:1808.07993 (2018)
Li, J., Wang, J., Tian, Q., Gao, W., Zhang, S.: Global-local temporal representations for video person re-identification. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3957–3966 (2019)
Li, S., Bak, S., Carr, P., Wang, X.: Diversity regularized spatiotemporal attention for video-based person re-identification. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 369–378 (2018)
Lin, T.Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 936–944 (2016)
Liu, J., Zha, Z., Wu, W., Zheng, K., Sun, Q.: Spatial-temporal correlation and topology learning for person re-identification in videos. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4368–4377 (2021)
Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8759–8768 (2018)
Liu, X., Zhang, P., Yu, C., Lu, H., Yang, X.: Watching you: global-guided reciprocal learning for video-based person re-identification. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13329–13338 (2021)
Liu, Z., Zhang, L., Yang, Y.: Hierarchical bi-directional feature perception network for person re-identification. In: Proceedings of the 28th ACM International Conference on Multimedia (2020)
Pan, H., Chen, Y., He, Z.: Multi-granularity graph pooling for video-based person re-identification. Neural Netw. : Off. J. Int. Neural Netw. Soc. 160, 22–33 (2022)
Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vision 128, 336–359 (2016)
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: convolutional block attention module. In: European Conference on Computer Vision (2018)
Wu, L., Wang, Y., Gao, J., Li, X.: Where-and-when to look: deep Siamese attention networks for video-based person re-identification. IEEE Trans. Multimed. 21, 1412–1424 (2018)
Yan, Y., et al.: Learning multi-granular hypergraphs for video-based person re-identification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2896–2905 (2020)
Zang, X., Li, G., Gao, W.: Multidirection and multiscale pyramid in transformer for video-based pedestrian retrieval. IEEE Trans. Industr. Inf. 18(12), 8776–8785 (2022). https://doi.org/10.1109/TII.2022.3151766
Zhang, Z., Lan, C., Zeng, W., Chen, Z.: Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10404–10413 (2020). https://doi.org/10.1109/CVPR42600.2020.01042
Zhao, Q., et al.: M2det: a single-shot object detector based on multi-level feature pyramid network. In: AAAI Conference on Artificial Intelligence (2018)
Zhao, Y., Shen, X., Jin, Z., Lu, H., Hua, X.: Attribute-driven feature disentangling and temporal aggregation for video person re-identification. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4908–4917 (2019)
Zheng, L., et al.: Mars: a video benchmark for large-scale person re-identification. In: European Conference on Computer Vision (2016)
Zhou, K., Yang, Y., Cavallaro, A., Xiang, T.: Omni-scale feature learning for person re-identification. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3701–3711 (2019)
Acknowledgements
This work is supported by National Natural Science Foundation of China (Nos. 62266009, 62276073, 61966004, 61962007), Guangxi Natural Science Foundation (Nos. 2019GXNSFDA245018, 2018GXNSFDA281009, 2018GXNSFDA294001), Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing, Innovation Project of Guangxi Graduate Education (YCSW2023187), and Guangxi “Bagui Scholar” Teams for Innovation and Research Project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wu, L., Zhang, C., Li, Z., Hu, L. (2024). Multi-scale Context Aggregation for Video-Based Person Re-Identification. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1968. Springer, Singapore. https://doi.org/10.1007/978-981-99-8181-6_8
Download citation
DOI: https://doi.org/10.1007/978-981-99-8181-6_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8180-9
Online ISBN: 978-981-99-8181-6
eBook Packages: Computer ScienceComputer Science (R0)