Skip to main content
Log in

A motion-aware ConvLSTM network for action recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Human action recognition is an emerging goal of computer vision with several applications such as video surveillance and human-computer interaction. Despite many attempts to develop deep architectures to learn the spatio-temporal features of video, hand-crafted optical flow is still an important part of the recognition process. To engage the motion features deeply inside the learning process, we propose a spatio-temporal video recognition network where a motion-aware long short-term memory module is introduced to estimate the motion flow along with extracting spatio-temporal features. A specific optical flow estimator is subsumed which is based on kernelized cross correlation. The proposed network can be used without any extra learning process and there is no need to pre-compute and store the optical flow. Extensive experiments on two action recognition benchmarks verify the effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Kourtzi Z, Kanwisher N (2000) Activation in human mt/mst by static images with implied motion. J Cogn Neurosci 12(1):48–55

    Article  Google Scholar 

  2. Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21

    Article  Google Scholar 

  3. Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990

    Article  Google Scholar 

  4. Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning, pp 843–852

  5. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634

  6. Ordóñez FJ, Roggen D (2016) Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16(1):115

    Article  Google Scholar 

  7. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  8. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941

  9. Xingjian S, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-C (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–810

  10. Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek C G (2018) Videolstm convolves, attends and flows for action recognition. Comput Vis Image Underst 166:41–50

    Article  Google Scholar 

  11. Jung M, Lee H, Tani J Adaptive detrending to accelerate convolutional gated recurrent unit training for contextual video recognition. arXiv:1705.08764

  12. Sun L, Jia K, Chen K, Yeung D Y, Shi B E, Savarese S Lattice long short-term memory for human action recognition. arXiv:1708.03958

  13. Ng JY-H, Choi J, Neumann J, Davis L S Actionflownet: Learning motion representation for action recognition. arXiv:1612.03052

  14. Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, Van Der Smagt P, Cremers D, Brox T (2015) Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 2758–2766

  15. Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T Flownet 2.0: evolution of optical flow estimation with deep networks. arXiv:1612.01925

  16. Wang C, Zhang L, Xie L, Yuan J Kernel cross-correlator. arXiv:1709.05936

  17. Wang C, Ji T, Nguyen T-M, Xie L Correlation flow: robust optical flow using kernel cross-correlators. arXiv:1802.07078

  18. Borst A (2007) Correlation versus gradient type motion detectors: the pros and cons. Philos Trans Royal Soc Lond B: Biol Sci 362(1479):369–374

    Article  Google Scholar 

  19. Potters M, Bialek W (1994) Statistical mechanics and visual signal processing. J Phys I 4(11):1755–1775

    Google Scholar 

  20. Borst A, Helmstaedter M (2015) Common circuit design in fly and mammalian motion vision. Nat Neurosci 18(8):1067

    Article  Google Scholar 

  21. Soomro K, Zamir A R, Shah M Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  22. Kuehne H, Jhuang H, Stiefelhagen R, Serre T (2013) Hmdb51: a large video database for human motion recognition. In: High performance computing in science and engineering ’12. Springer, pp 571–582

  23. Simonyan K, Zisserman A Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  24. Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  25. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: 2013 IEEE international conference on computer vision (ICCV). IEEE, pp 3551–3558

  26. Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125

    Article  Google Scholar 

  27. Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314

  28. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 4489–4497

  29. Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level cnn: Saliency-aware 3-d cnn with lstm for video action recognition. IEEE Signal Process Lett 24(4):510–514

    Article  Google Scholar 

  30. Han Y, Zhang P, Zhuo T, Huang W, Zhang Y (2018) Going deeper with two-stream convnets for action recognition in video surveillance. Pattern Recogn Lett 107:83–90

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Reza Safabakhsh.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Majd, M., Safabakhsh, R. A motion-aware ConvLSTM network for action recognition. Appl Intell 49, 2515–2521 (2019). https://doi.org/10.1007/s10489-018-1395-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-018-1395-8

Keywords

Navigation