Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

Zhou, Shuren; Qiu, Jia; Solanki, Arun

doi:10.1007/s00530-021-00831-4

Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

Special Issue Paper
Published: 14 July 2021

Volume 28, pages 2123–2131, (2022)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Shuren Zhou¹,
Jia Qiu¹ &
Arun Solanki²

270 Accesses
2 Citations
Explore all metrics

Abstract

The biggest difference between video-based action recognition and image-based action recognition is that the former has an extra feature of time dimension. Most methods of action recognition based on deep learning adopt: (1) using 3D convolution to modeling the temporal features; (2) introducing an auxiliary temporal feature, such as optical flow. However, the 3D convolution network usually consumes huge computational resources. The extraction of optical flow requires an extra tedious process with an extra space for storage, and is usually modeled for short-range temporal features. To construct the temporal features better, in this paper we propose a multi-scale attention spatial–temporal features network based on SSD, by means of piecewise on long range of the whole video sequence to sparse sampling of video, using the self-attention mechanism to capture the relation between one frame and the sequence of frames sampled on the entire range of video, making the network notice the representative frames on the sequence. Moreover, the attention mechanism is used to assign different weights to the inter-frame relations representing different time scales, so as to reasoning the contextual relations of actions in the time dimension. Our proposed method achieves competitive performance on two commonly used datasets: UCF101 and HMDB51.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Human action recognition using fusion of multiview and deep features: an application to video surveillance

Article 14 March 2020

References

Fusier, F., Valentin, V., Bremond, F.: Video understanding for complex activity recognition[J]. Mach. Vis. Appl. 18(3–4), 167–188 (2007)
Article MATH Google Scholar
Qin, J., Li, H., Xiang, X., Tan, Y., Pan, W., Ma, W., Xiong, N.N.: An encrypted image retrieval method based on Harris corner optimization and LSH in cloud computing. IEEE Access 7(1), 24626–24633 (2019)
Article Google Scholar
Gu, K., Jia, W., Wang, G., et al.: Efficient and secure attribute-based signature for monotone predicates. Acta Inform. 54, 521–541 (2017)
Article MathSciNet MATH Google Scholar
Wang J, Gao Y, Yin X, Li F, Kim H (2018) An enhanced PEGASIS algorithm with mobile sink support for wireless sensor networks. Wirel. Commun. Mob. Comput. (2018). https://doi.org/10.1155/2018/9472075
Article Google Scholar
Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space–time shapes. TPAMI 29(12), 2247–2253 (2007)
Article Google Scholar
Jia, K., Yeung, D.-Y.: Human action recognition using local spatio-temporal discriminant embedding. In CVPR, p. 1 (2008)
Klaeser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC, p. 1 (2008)
Wang, H., Schmid, C.: Action recognition with improved trajectories. ICCV 1(5), 8 (2013)
Google Scholar
Laptev, I.: On space–-time interest points. IJCV 64(2–3), 5 (2005)
Google Scholar
Xia, Z., Hu, Z., Luo, J.: UPTP vehicle trajectory prediction based on user preference under complexity environment. Wirel. Pers. Commun. 97, 4651–4665 (2017). https://doi.org/10.1007/s11277-017-4743-9
Article Google Scholar
He, S., Li, Z., Tang, Y., Liao, Z., Li, F., Lim, S-J.: Parameters compressing in deep learning. CMC 62(1), 321–336 (2020)
Article Google Scholar
Tang, Q., Xie, M.Z., Yang, K., Yuansheng, L. Dongdai, Z. Yun, S.: A decision function based smart charging and discharging strategy for electric vehicle in smart grid. Mob. Netw. Appl. 24, 1722–1731 (2019)
Article Google Scholar
Ji, X., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NPIS, pp. 1097–1105 (2012)
Karpathy, A., Toderici, G., Shetty, S., Leung, T.; Sukthankar, R.: Largescale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Long, M., Peng, F., Li, H.: Separable reversible data hiding and encryption for HEVC video. J. Real Time Image Proc. 14, 171–182 (2018)
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., FeiFei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Zhang, J., Jin, X., Sun, J., Wang, J., Arun, K.S.: Spatial and semantic convolutional features for robust visual object tracking. In: Multimedia Tools and Applications, pp. 15095–15115 (2020)
Gui, Y., Zeng, G.: Joint learning of visual and spatial features for edit propagation from a single image. In: The Visual Computer, pp. 36:469–482 (2019)
Simonyan, K.; Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Wang, L.M., Xiong, Y.J., Wang, Z., Qiao, Y., Lin, D.H.O., et al.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36 (2016)
Liu, W., Anguelov, D., Erhan, D., Christian, S., Scott R., Cheng-Yang F., Alexander C.: SSD: Single Shot MultiBox Detector. In: European Conference on Computer Vision, pp. 21–37 (2016)
Dalal, N.F., Triggs, B.S.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2005. USA: IEEE, pp. 886–893 (2005)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, pp. 1–8 (2008)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Palur, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, pp. 4489–4497 (2015)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks (2017). arXiv:1711.10305
Li, C., Zhong, Q., Xie, D, et al.: Collaborative spatio-temporal feature learning for video action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)
Peng, Y.X., Zhao, Y.Z., Zhang, J.C.: Two-stream collaborative learning with spatial-temporal attention for video classification. In: IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 3, pp. 773–786 (2018)
Carreira, J., Zisserman, A.: Quo vadis, action recognition?a new model and the kinetics dataset. CVPR 2(4), 5 (2017)
Google Scholar
Sun, S., Kuang, Z, Ouyang, W., Sheng, L., Zhang, W: Optical flow guided feature: a fast and robust motion representation for video action recognition (2017). arXiv:1711.11152
Fischer, P., Dosovitskiy, A., Ilg, E., Husser, P., Hazrba, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: FlowNet: learning optical flow with convolutional networks. In: International Conference on Computer Vision (ICCV) (2015)
Zhu, Y., Lan, Z.Z., Newsam, S., XHauptmann, S: Hidden two-stream convolutional networks for action recognition. In: Asian Conference on Computer Vision, pp. 363–378 (2018)
Piergiovanni, A., Ryoo, M.S: Representation flow for action recognition (2018). arXiv:1810.01455
Mnih, V.F., Heess, N.S.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, NIPS (2014)
Qu, Z.W., Cao, B.Y., Wang, X.R., Li, F., Xu, P.R., et al.: Feedback lstm network based on attention for image description generator. Comput. Mater. Contin. 59(2), 575–589 (2019)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Neural Information Processing Systems (NIPS) (2017)
Zhao, Z., Elgammal, A.M.: Information theoretic key frame selection for action recognition. In: British Machine Vision Conference (BMVC), pp. 1–10 (2008)
Liu, L., Shao, L., Rockett, P.: Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition. Pattern Recogn. 46(7), 1810–1818 (2013)
Article Google Scholar
Santoro, A., Raposo, D., Barrett, D.G., Malinowski, M., Pascanu, R., Battaglia,P., Lillicrap, T.: A simple neural network module for relational reasoning (2017). arXiv:1706.01427
Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos (2017). arXiv:1711.08496
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition[J] (2014). arXiv:1409.1556
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition[J] (2015). arXiv:1512.03385
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild[J] (2012). arXiv:1212.0402
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. International Conference on Computer Vision, pp. 2556–2563 (2011)
Cai, Z.W., Wang, L.M., Peng, X.J.: Qiao, Y.: Multi-view super vector for action recognition. IEEE Conference on Computer Vision and Pattern Recognition, pp. 596–603 (2014)
Kantorov, V., Laptev, I.: Efficient feature extraction, encoding and classification for action recognition. IEEE Conference on Computer Vision and Pattern Recognition, pp. 2593–2600 (2014)
Zhang, B.W., Wang, L.M, Wang, Z., Qiao, Y., Wang, H.L.: Real-time action recognition with enhanced motion vector CNNs. IEEE Conference on Computer Vision and Pattern Recognition, pp. 2718–2726 (2016)
Diba, A. et al.: Spatio-temporal channel correlation networks for action classification. In: Computer Vision—ECCV 2018, vol. 11208, pp. 299–315 (2018)
Jiang, B., Wang, M., Gan, W., Wu, W.: STM: spatio-temporal and motion encoding for action recognition. In: ICCV (2019)

Download references

Acknowledgements

This work was supported by the Scientific Research Fund of Hunan Provincial Education Department of China ProjectNo.17A007; and the Degree and Post-graduate Education Reform Project of Hunan Province of China (Project No. 2020JGZD043).

Author information

Authors and Affiliations

School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha, 410114, China
Shuren Zhou & Jia Qiu
School of Information and Communication Technology, Gautam Buddha University, Noida, India
Arun Solanki

Authors

Shuren Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jia Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Arun Solanki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuren Zhou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, S., Qiu, J. & Solanki, A. Improved SSD using deep multi-scale attention spatial–temporal features for action recognition. Multimedia Systems 28, 2123–2131 (2022). https://doi.org/10.1007/s00530-021-00831-4

Download citation

Received: 19 January 2020
Accepted: 02 July 2021
Published: 14 July 2021
Issue Date: December 2022
DOI: https://doi.org/10.1007/s00530-021-00831-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Video summarization using deep learning techniques: a detailed analysis and investigation

Human action recognition using fusion of multiview and deep features: an application to video surveillance

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Video summarization using deep learning techniques: a detailed analysis and investigation

Human action recognition using fusion of multiview and deep features: an application to video surveillance

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation