Applied Intelligence

, Volume 49, Issue 7, pp 2515–2521 | Cite as

A motion-aware ConvLSTM network for action recognition

  • Mahshid Majd
  • Reza SafabakhshEmail author


Human action recognition is an emerging goal of computer vision with several applications such as video surveillance and human-computer interaction. Despite many attempts to develop deep architectures to learn the spatio-temporal features of video, hand-crafted optical flow is still an important part of the recognition process. To engage the motion features deeply inside the learning process, we propose a spatio-temporal video recognition network where a motion-aware long short-term memory module is introduced to estimate the motion flow along with extracting spatio-temporal features. A specific optical flow estimator is subsumed which is based on kernelized cross correlation. The proposed network can be used without any extra learning process and there is no need to pre-compute and store the optical flow. Extensive experiments on two action recognition benchmarks verify the effectiveness of the proposed approach.


Human action recognition Deep learning Convolutional networks LSTM ConvLSTM 



  1. 1.
    Kourtzi Z, Kanwisher N (2000) Activation in human mt/mst by static images with implied motion. J Cogn Neurosci 12(1):48–55CrossRefGoogle Scholar
  2. 2.
    Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21CrossRefGoogle Scholar
  3. 3.
    Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990CrossRefGoogle Scholar
  4. 4.
    Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning, pp 843–852Google Scholar
  5. 5.
    Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634Google Scholar
  6. 6.
    Ordóñez FJ, Roggen D (2016) Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16(1):115CrossRefGoogle Scholar
  7. 7.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576Google Scholar
  8. 8.
    Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941Google Scholar
  9. 9.
    Xingjian S, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-C (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–810Google Scholar
  10. 10.
    Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek C G (2018) Videolstm convolves, attends and flows for action recognition. Comput Vis Image Underst 166:41–50CrossRefGoogle Scholar
  11. 11.
    Jung M, Lee H, Tani J Adaptive detrending to accelerate convolutional gated recurrent unit training for contextual video recognition. arXiv:1705.08764
  12. 12.
    Sun L, Jia K, Chen K, Yeung D Y, Shi B E, Savarese S Lattice long short-term memory for human action recognition. arXiv:1708.03958
  13. 13.
    Ng JY-H, Choi J, Neumann J, Davis L S Actionflownet: Learning motion representation for action recognition. arXiv:1612.03052
  14. 14.
    Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, Van Der Smagt P, Cremers D, Brox T (2015) Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 2758–2766Google Scholar
  15. 15.
    Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T Flownet 2.0: evolution of optical flow estimation with deep networks. arXiv:1612.01925
  16. 16.
    Wang C, Zhang L, Xie L, Yuan J Kernel cross-correlator. arXiv:1709.05936
  17. 17.
    Wang C, Ji T, Nguyen T-M, Xie L Correlation flow: robust optical flow using kernel cross-correlators. arXiv:1802.07078
  18. 18.
    Borst A (2007) Correlation versus gradient type motion detectors: the pros and cons. Philos Trans Royal Soc Lond B: Biol Sci 362(1479):369–374CrossRefGoogle Scholar
  19. 19.
    Potters M, Bialek W (1994) Statistical mechanics and visual signal processing. J Phys I 4(11):1755–1775Google Scholar
  20. 20.
    Borst A, Helmstaedter M (2015) Common circuit design in fly and mammalian motion vision. Nat Neurosci 18(8):1067CrossRefGoogle Scholar
  21. 21.
    Soomro K, Zamir A R, Shah M Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
  22. 22.
    Kuehne H, Jhuang H, Stiefelhagen R, Serre T (2013) Hmdb51: a large video database for human motion recognition. In: High performance computing in science and engineering ’12. Springer, pp 571–582Google Scholar
  23. 23.
    Simonyan K, Zisserman A Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
  24. 24.
    Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105Google Scholar
  25. 25.
    Wang H, Schmid C (2013) Action recognition with improved trajectories. In: 2013 IEEE international conference on computer vision (ICCV). IEEE, pp 3551–3558Google Scholar
  26. 26.
    Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125CrossRefGoogle Scholar
  27. 27.
    Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314Google Scholar
  28. 28.
    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 4489–4497Google Scholar
  29. 29.
    Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level cnn: Saliency-aware 3-d cnn with lstm for video action recognition. IEEE Signal Process Lett 24(4):510–514CrossRefGoogle Scholar
  30. 30.
    Han Y, Zhang P, Zhuo T, Huang W, Zhang Y (2018) Going deeper with two-stream convnets for action recognition in video surveillance. Pattern Recogn Lett 107:83–90CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Amirkabir University of TechnologyTehranIran

Personalised recommendations