Random Walk Erasing with Attention Calibration for Action Recognition

Tian, Yuze; Zhong, Xian; Liu, Wenxuan; Jia, Xuemei; Zhao, Shilei; Ye, Mang

doi:10.1007/978-3-030-89370-5_18

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13033))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

1374 Accesses
3 Citations

Abstract

Action recognition in videos has attracted growing research interests because of the explosive surveillance data in social security applications. In this process, due to the distraction and deviation of the network caused by occlusions, human action features usually suffer different degrees of performance degradation. Considering the occlusion scene in the wild, we find that the occluded objects usually move unpredictably but continuously. Thus, we propose a random walk erasing with attention calibration (RWEAC) for action recognition. Specifically, we introduce the random walk erasing (RWE) module to simulate the unknown occluded real conditions in frame sequence, expanding the diversity of data samples. In the case of erasing (or occlusion), the attention area is sparse. We leverage the attention calibration (AC) module to force the attention to stay stable in other regions of interest. In short, our novel RWEAC network enhances the ability to learn comprehensive features in a complex environment and make the feature representation robust. Experiments are conducted on the challenging video action recognition UCF101 and HMDB51 datasets. The extensive comparison results and ablation studies demonstrate the effectiveness and strength of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Angelini, F., Fu, Z., Long, Y., Shao, L., Naqvi, S.M.: 2D pose-based real-time human action recognition with occlusion-handling. IEEE Trans. Multimedia 22(6), 1433–1446 (2020)
Article Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings IEEE/CVF Conference Computing Vision Pattern Recognition (CVPR), pp. 4724–4733 (2017)
Google Scholar
Cheng, Y., Yang, B., Wang, B., Wending, Y., Tan, R.T.: Occlusion-aware networks for 3D human pose estimation in video. In: Proceedings IEEE/CVF International Conference Computing Vision (ICCV) (2019)
Google Scholar
Diba, A., et al.: Spatio-temporal channel correlation networks for action classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 299–315. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_18
Chapter Google Scholar
Diba, A., Sharma, V., Gool, L.V.: Deep temporal linear encoding networks. In: Proceedings IEEE/CVF Conference Computing Vision Pattern Recognition (CVPR) (2017)
Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings IEEE/CVF Conference Computing Vision Pattern Recognition (CVPR) (2015)
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings IEEE/CVF International Conference Computing Vision (ICCV), pp. 6201–6210 (2019)
Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings IEEE/CVF Conference Computing Vision Pattern Recognition (CVPR), pp. 1933–1941 (2016)
Google Scholar
Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: SMART frame selection for action recognition. arxiv:2012.10671 (2020)
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: Proceedings IEEE/CVF Conference Computing Vision Pattern Recognition (CVPR), pp. 6546–6555 (2018)
Google Scholar
Jiang, B., Wang, M., Gan, W., Wu, W., Yan, J.: STM: spatiotemporal and motion encoding for action recognition. In: Proceedings IEEE/CVF International Conference Computing Vision (ICCV), pp. 2000–2009 (2019)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arxiv:1705.0695 (2017)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T.A., Serre, T.: HMDB: a large video database for human motion recognition. In: Proceedings IEEE/CVF International Conference Computing Vision (ICCV), pp. 2556–2563 (2011)
Google Scholar
Li, J., Wei, P., Zhang, Y., Zheng, N.: A slow-i-fast-p architecture for compressed video action recognition. In: Proceedings ACM International Conference Multimedia (ACM MM), pp. 2039–2047 (2020)
Google Scholar
Li, K., et al.: Learning from weakly-labeled web videos via exploring sub-concepts. arxiv:2101.03713 (2021)
Li, T., Fan, L., Zhao, M., Liu, Y., Katabi, D.: Making the invisible visible: action recognition through walls and occlusions. In: Proceedings IEEE/CVF International Conference Computing Vision (ICCV), pp. 872–881 (2019)
Google Scholar
Li, X., Shuai, B., Tighe, J.: Directional temporal modeling for action recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 275–291. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_17
Chapter Google Scholar
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: TEA: temporal excitation and aggregation for action recognition. In: Proceedings IEEE/CVF Conference Computing Vision Pattern Recognition (CVPR) (2020)
Google Scholar
Liao, L., Xiao, J., Wang, Z., Lin, C., Satoh, S.: Uncertainty-aware semantic guidance and estimation for image inpainting. IEEE J. Sel. Top. Sig. Process. 15(2), 310–323 (2021)
Article Google Scholar
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings IEEE/CVF International Conference Computing Vision (ICCV), pp. 7082–7092 (2019)
Google Scholar
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: Proceedings IEEE/CVF International Conference Computing Vision (ICCV) (2017)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings Advanced Neural Information Processing System (NIPS), pp. 568–576 (2014)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arxiv:1212.0402 (2012)
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings IEEE/CVF International Conference Computing Vision (ICCV), pp. 4489–4497 (2015)
Google Scholar
Ulutan, O., Rallapalli, S., Srivatsa, M., Torres, C., Manjunath, B.S.: Actor conditioned attention maps for video action detection. In: Proceedings IEEE Workshop/Winter Conference Application Computing Vision (WACV), pp. 516–525 (2020)
Google Scholar
Wang, J., Gao, Y., Li, K., Lin, Y., Ma, A.J., Sun, X.: Removing the background by adding the background: towards background robust self-supervised video representation learning. arXiv:2009.05769 (2020)
Wang, L., Li, W., Li, W., Gool, L.V.: Appearance-and-relation networks for video classification. In: Proceedings IEEE/CVF Conference Computing Vision Pattern Recognition (CVPR), pp. 1430–1439 (2018)
Google Scholar
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings IEEE/CVF Conference Computing Vision Pattern Recognition (CVPR) (2015)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Wang, X., et al.: S\(^{3}\)d: scalable pedestrian detection via score scale surface discrimination. IEEE Trans. Circuits Syst. Video Technol. 30(10), 3332–3344 (2020)
Article Google Scholar
Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: Proceedings IEEE/CVF Conference Computing Vision Pattern Recognition (CVPR), pp. 7794–7803 (2018)
Google Scholar
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Chapter Google Scholar
Xu, X., Liu, L., Zhang, X., Guan, W., Hu, R.: Rethinking data collection for person re-identification: active redundancy reduction. Pattern Recognit. 113, 107827 (2021)
Article Google Scholar
Zhou, L., Chen, Y., Gao, Y., Wang, J., Lu, H.: Occlusion-Aware Siamese network for human pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 396–412. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_24
Chapter Google Scholar
Zolfaghari, M., Singh, K., Brox, T.: ECO: efficient convolutional network for online video understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 713–730. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_43
Chapter Google Scholar

Download references

Acknowledgements

This work was supported in part by the Fundamental Research Funds for the Central Universities of China under Grant 191010001 and in part by the Hubei Key Laboratory of Transportation Internet of Things under Grant 2020III026GX.

Author information

Authors and Affiliations

School of Computer and Artificial Intelligence, Wuhan University of Technology, Wuhan, 430070, China
Yuze Tian, Xian Zhong, Wenxuan Liu, Xuemei Jia & Shilei Zhao
Hubei Key Laboratory of Transportation Internet of Things, Wuhan University of Technology, Wuhan, 430070, China
Xian Zhong
School of Computer Science, Wuhan University, Wuhan, 430072, China
Mang Ye

Authors

Yuze Tian
View author publications
You can also search for this author in PubMed Google Scholar
Xian Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Wenxuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xuemei Jia
View author publications
You can also search for this author in PubMed Google Scholar
Shilei Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Mang Ye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xian Zhong .

Editor information

Editors and Affiliations

MIMOS Berhad, Kuala Lumpur, Malaysia
Duc Nghia Pham
Sirindhorn International Institute of Science and Technology, Thammasat University, Mueang Pathum Thani, Thailand
Thanaruk Theeramunkong
Data61, CSIRO, Brisbane, QLD, Australia
Guido Governatori
Department of Philosophy, Tsinghua University, Beijing, China
Fenrong Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tian, Y., Zhong, X., Liu, W., Jia, X., Zhao, S., Ye, M. (2021). Random Walk Erasing with Attention Calibration for Action Recognition. In: Pham, D.N., Theeramunkong, T., Governatori, G., Liu, F. (eds) PRICAI 2021: Trends in Artificial Intelligence. PRICAI 2021. Lecture Notes in Computer Science(), vol 13033. Springer, Cham. https://doi.org/10.1007/978-3-030-89370-5_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-89370-5_18
Published: 01 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89369-9
Online ISBN: 978-3-030-89370-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics