Skip to main content
Log in

Weakly-Supervised Action Localization, and Action Recognition Using Global–Local Attention of 3D CNN

International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

3D convolutional neural network (3D CNN) captures spatial and temporal information on 3D data such as video sequences. However, due to the convolution and pooling mechanism, the information loss that occurs seems unavoidable. To improve the visual explanations and classification in 3D CNN, we propose two approaches; (i) aggregate layer-wise global to local (global–local) discrete gradient using trained 3DResNext network, and (ii) implement attention gating network to improve the accuracy of the action recognition. The proposed approach intends to show the usefulness of every layer termed as global–local attention in 3D CNN via visual attribution, weakly-supervised action localization, and action recognition. Firstly, the 3DResNext is trained and applied for action classification using backpropagation concerning the maximum predicted class. The gradient and activation of every layer are then up-sampled. Later, aggregation is used to produce more nuanced attention, which points out the most critical part of the predicted class’s input videos. We use contour thresholding of final attention for final localization. We evaluate spatial and temporal action localization in trimmed videos using fine-grained visual explanation via 3DCAM. Experimental results show that the proposed approach produces informative visual explanations and discriminative attention. Furthermore, the action recognition via attention gating of each layer produces better classification results than the baseline model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

References

  • Adebayo, J., et al. (2018). Sanity checks for saliency maps. Advances in Neural Information Processing Systems.

  • Bargal, S.A., et al. (2018)Excitation backprop for RNNs. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Bazzani, L., et al. (2016). Self-taught object localization with deep networks. In 2016 IEEE winter conference on applications of computer vision (WACV). IEEE.

  • Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.

    Article  Google Scholar 

  • Carreira, J., & Zisserman, A. (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE.

  • Chattopadhay, A., et al. (2018) Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE winter conference on applications of computer vision (WACV). IEEE.

  • Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018). Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, 839–847. https://doi.org/10.1109/WACV.2018.00097.

    Article  Google Scholar 

  • Chen, L., et al. (2017). Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Choe, J., et al. (2020). Evaluating weakly supervised object localization methods right. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

  • Deng, J., et al. (2009) Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. IEEE.

  • Fukui, H., et al. (2019). Attention branch network: Learning of attention mechanism for visual explanation. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Girdhar, R., & Deva R. (2017). Attentional pooling for action recognition. Advances in Neural Information Processing Systems.

  • Hara, K., Kataoka, H., & Satoh, Y. (2018) Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA.

  • He, K., et al. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Kawaguchi, K., & Bengio, Y. (2019). Depth with nonlinearity creates no bad local minima in ResNets. Neural Networks, 118, 167–174.

    Article  Google Scholar 

  • Kuehne, H., et al. (2013) Hmdb51: A large video database for human motion recognition. In High performance computing in science and engineering ‘12. (pp. 571-582). Berlin: Springer.

  • Li, W., Xiatian, Z., & Shaogang, G. (2017). Harmonious attention network for person reidentification. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Oquab, M., et al. (2015) Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Preim, B., & Botha, C.P. (2013). Visual computing for medicine: Theory, algorithms, and applications. Newnes.

  • Russakovsky, O., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Schlemper, J., et al. (2019). Attention gated networks: Learning to leverage salient regions in medical images. Medical Image Analysis, 53, 197–207.

    Article  Google Scholar 

  • Selvaraju, R. R., et al.(2017) Grad-CAM: Visual explanations from deep networks via gradient-based localization. In ICCV.

  • Shamir, O. (2018). Are ResNets provably better than linear predictors?. Advances in neural information processing systems.

  • Shrikumar, A., Greenside, P., & Kundaje, A. (2017) Learning important features through propagating activation differences. In Proceedings of the 34th international conference on machine learning, (Volume 70. JMLR. org).

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  • Soomro, K., Zamir, A.R., Shah, M. (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402

  • Sundararajan, M., Ankur, T., & Qiqi, Y. (2017). Axiomatic attribution for deep networks. In Proceedings of the 34th international conference on machine learning, (Volume 70. JMLR. org).

  • Szegedy, C., et al. (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Tran, D., et al. (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision.

  • Xie, S., et al. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Yudistira, N., & Kurita, T. (2017) Correlation Net: Spatio temporal multimodal deep learning for action recognition. arXiv preprint arXiv:1807.08291.

  • Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. European conference on computer vision. Cham: Springer.

    Google Scholar 

  • Zhang, J., et al. (2018). Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10), 1084–1102.

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank KAKENHI Project No. 16K00239 for funding the research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Novanto Yudistira.

Additional information

Communicated by Koichi Kise.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yudistira, N., Kavitha, M.S. & Kurita, T. Weakly-Supervised Action Localization, and Action Recognition Using Global–Local Attention of 3D CNN. Int J Comput Vis 130, 2349–2363 (2022). https://doi.org/10.1007/s11263-022-01649-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01649-x

Keywords

Navigation