Abstract
Human action recognition is an emerging goal of computer vision with several applications such as video surveillance and human-computer interaction. Despite many attempts to develop deep architectures to learn the spatio-temporal features of video, hand-crafted optical flow is still an important part of the recognition process. To engage the motion features deeply inside the learning process, we propose a spatio-temporal video recognition network where a motion-aware long short-term memory module is introduced to estimate the motion flow along with extracting spatio-temporal features. A specific optical flow estimator is subsumed which is based on kernelized cross correlation. The proposed network can be used without any extra learning process and there is no need to pre-compute and store the optical flow. Extensive experiments on two action recognition benchmarks verify the effectiveness of the proposed approach.
Similar content being viewed by others
References
Kourtzi Z, Kanwisher N (2000) Activation in human mt/mst by static images with implied motion. J Cogn Neurosci 12(1):48–55
Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21
Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990
Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning, pp 843–852
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Ordóñez FJ, Roggen D (2016) Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16(1):115
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941
Xingjian S, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-C (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–810
Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek C G (2018) Videolstm convolves, attends and flows for action recognition. Comput Vis Image Underst 166:41–50
Jung M, Lee H, Tani J Adaptive detrending to accelerate convolutional gated recurrent unit training for contextual video recognition. arXiv:1705.08764
Sun L, Jia K, Chen K, Yeung D Y, Shi B E, Savarese S Lattice long short-term memory for human action recognition. arXiv:1708.03958
Ng JY-H, Choi J, Neumann J, Davis L S Actionflownet: Learning motion representation for action recognition. arXiv:1612.03052
Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, Van Der Smagt P, Cremers D, Brox T (2015) Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 2758–2766
Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T Flownet 2.0: evolution of optical flow estimation with deep networks. arXiv:1612.01925
Wang C, Zhang L, Xie L, Yuan J Kernel cross-correlator. arXiv:1709.05936
Wang C, Ji T, Nguyen T-M, Xie L Correlation flow: robust optical flow using kernel cross-correlators. arXiv:1802.07078
Borst A (2007) Correlation versus gradient type motion detectors: the pros and cons. Philos Trans Royal Soc Lond B: Biol Sci 362(1479):369–374
Potters M, Bialek W (1994) Statistical mechanics and visual signal processing. J Phys I 4(11):1755–1775
Borst A, Helmstaedter M (2015) Common circuit design in fly and mammalian motion vision. Nat Neurosci 18(8):1067
Soomro K, Zamir A R, Shah M Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Kuehne H, Jhuang H, Stiefelhagen R, Serre T (2013) Hmdb51: a large video database for human motion recognition. In: High performance computing in science and engineering ’12. Springer, pp 571–582
Simonyan K, Zisserman A Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: 2013 IEEE international conference on computer vision (ICCV). IEEE, pp 3551–3558
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 4489–4497
Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level cnn: Saliency-aware 3-d cnn with lstm for video action recognition. IEEE Signal Process Lett 24(4):510–514
Han Y, Zhang P, Zhuo T, Huang W, Zhang Y (2018) Going deeper with two-stream convnets for action recognition in video surveillance. Pattern Recogn Lett 107:83–90
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Majd, M., Safabakhsh, R. A motion-aware ConvLSTM network for action recognition. Appl Intell 49, 2515–2521 (2019). https://doi.org/10.1007/s10489-018-1395-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-018-1395-8