Advertisement

Deep Learning for Action and Gesture Recognition in Image Sequences: A Survey

  • Maryam Asadi-Aghbolaghi
  • Albert Clapés
  • Marco Bellantonio
  • Hugo Jair Escalante
  • Víctor Ponce-López
  • Xavier Baró
  • Isabelle Guyon
  • Shohreh Kasaei
  • Sergio Escalera
Chapter
Part of the The Springer Series on Challenges in Machine Learning book series (SSCML)

Abstract

Interest in automatic action and gesture recognition has grown considerably in the last few years. This is due in part to the large number of application domains for this type of technology. As in many other computer vision areas, deep learning based methods have quickly become a reference methodology for obtaining state-of-the-art performance in both tasks. This chapter is a survey of current deep learning based methodologies for action and gesture recognition in sequences of images. The survey reviews both fundamental and cutting edge methodologies reported in the last few years. We introduce a taxonomy that summarizes important aspects of deep learning for approaching both tasks. Details of the proposed architectures, fusion strategies, main datasets, and competitions are reviewed. Also, we summarize and discuss the main works proposed so far with particular interest on how they treat the temporal dimension of data, their highlighting features, and opportunities and challenges for future research. To the best of our knowledge this is the first survey in the topic. We foresee this survey will become a reference in this ever dynamic field of research.

Keywords

Action recognition Gesture recognition Deep learning architectures Fusion strategies 

Notes

Acknowledgements

This work has been partially supported by the Spanish projects TIN2015-66951-C2-2-R and TIN2016-74946-P (MINECO/FEDER, UE) and CERCA Programme / Generalitat de Catalunya. Hugo Jair Escalante was supported by CONACyT under grants CB2014-241306 and PN-215546.

References

  1. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine learning on heterogeneous systems, 2015a, http://tensorflow.org/
  2. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, et al., Tensorflow: large-scale machine learning on heterogeneous systems, 2015b, http://www.tensorflow.org
  3. S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, S. Vijayanarasimhan, Youtube-8m: a large-scale video classification benchmark. CoRR, abs/1609.08675 (2016)Google Scholar
  4. E. Ahmed, M. Jones, T.K. Marks, An improved deep learning architecture for person re-identification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3908–3916Google Scholar
  5. R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, et al., Theano: a python framework for fast computation of mathematical expressions, 2016, arXiv:1605.02688
  6. M.R. Amer, S. Todorovic, A. Fern, S.-C. Zhu, Monte carlo tree search for scheduling activity recognition, in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1353–1360Google Scholar
  7. R. Araujo, M.S. Kamel, A semi-supervised temporal clustering method for facial emotion analysis, in 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), IEEE, 2014, pp. 1–6Google Scholar
  8. K. Avgerinakis, K. Adam, A. Briassouli, Y. Kompatsiaris, Moving camera human activity localization and recognition with motionplanes and multiple homographies, in ICIP, IEEE, 2015, pp. 2085–2089Google Scholar
  9. M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, A. Baskurt, Action classification in soccer videos with long short-term memory recurrent neural networks, in International Conference on Artificial Neural Networks (Springer, Berlin, 2010), pp. 154–159Google Scholar
  10. M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, A. Baskurt, Sequential deep learning for human action recognition, in International Workshop on Human Behavior Understanding (Springer, New York, 2011), pp. 29–39Google Scholar
  11. N. Ballas, L. Yao, A. Courville, Delving deeper into convolutional networks for learning video representations, in Proceedings of International Conference on Learning Representations, 2016Google Scholar
  12. I. Bayer, T. Silbermann. A multi modal approach to gesture recognition from audio and video data, in ICMI (2013), pp. 461–466. ISBN 978-1-4503-2129-7. doi: 10.1145/2522848.2532592
  13. Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is difficult. TNN 5(2), 157–166 (1994)Google Scholar
  14. H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, S. Gould, Dynamic image networks for action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3034–3042Google Scholar
  15. N.C. Camgoz, S. Hadfield, O. Koller, R. Bowden, Using convolutional 3d neural networks for user-independent continuous gesture recognition, in Proceedings IEEE International Conference of Pattern Recognition (International Conference on Pattern Recognition), ChaLearn Workshop, 2016Google Scholar
  16. X. Chai, Z. Liu, F. Yin, Z. Liu, X. Chen, Two streams recurrent neural networks for large-scale continuous gesture recognition, in Proceedings of International Conference on Pattern RecognitionW, 2016Google Scholar
  17. R. Chaudhry, F. Ofli, G. Kurillo, R. Bajcsy, R. Vidal, Bio-inspired dynamic 3d discriminative skeletal features for human action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 471–478Google Scholar
  18. R. Chavarriaga, H. Sagha, J. del R. Milln, Ensemble creation and reconfiguration for activity recognition: an information theoretic approach, in SMC, 2011, pp. 2761–2766. ISBN 978-1-4577-0652-3, http://dblp.uni-trier.de/db/conf/smc/smc2011.html#ChavarriagaSM11
  19. C. Chen, B. Zhang, Z. Hou, J. Jiang, M. Liu, Y. Yang, Action recognition from depth sequences using weighted fusion of 2d and 3d auto-correlation of gradients features, in Multimedia Tools and Applications, 2016, pp. 1–19Google Scholar
  20. W. Chen, J.J. Corso, Action detection by implicit intentional motion clustering, in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3298–3306Google Scholar
  21. G. Chéron, I. Laptev, C. Schmid, P-cnn: pose-based cnn features for action recognition, in Proceedings of the IEEE International Conference on Computer Vision, pp. 3218–3226, 2015Google Scholar
  22. R. Collobert, S. Bengio, J. Marithoz, Torch: a modular machine learning software library (Technical Report, IDIAP, 2002)Google Scholar
  23. Z. Deng, M. Zhai, L. Chen, Y. Liu, S. Muralidharan, M.J. Roshtkhari, G. Mori, Deep structured models for group activity recognition, in Proceedings of the British Machine Vision Conference (BMVC) ed. by M.W.J. Xianghua Xie, G.K.L. Tam (BMVA Press, Guildford, 2015), pp. 179.1–179.12. ISBN 1-901725-53-7. doi: 10.5244/C.29.179
  24. Z. Deng, A. Vahdat, H. Hu, G. Mori, Structure inference machines: recurrent neural networks for analyzing relations in group activity recognition, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016Google Scholar
  25. A. Diba, A. Mohammad Pazandeh, H. Pirsiavash, L. Van Gool, Deepcamp: deep convolutional action and attribute mid-level patterns, in IEEE CVPR, 2016Google Scholar
  26. Y. Du, W. Wang, L. Wang, Hierarchical recurrent neural network for skeleton based action recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 1110–1118. doi: 10.1109/CVPR.2015.7298714
  27. J. Duan, S. Zhou, J. Wan, X. Guo, S.Z. Li, Multi-modality fusion based on consensus-voting and 3d convolution for isolated gesture recognition, 2016, arXiv:1611.06689
  28. I.C. Duta, B. Ionescu, K. Aizawa, N. Sebe, Spatio-temporal vlad encoding for human action recognition in videos, in International Conference on Multimedia Modeling (Springer, New York, 2017), pp. 365–378Google Scholar
  29. T. Eleni, Gesture recognition with a convolutional long short term memory recurrent neural network, in ESANN, 2015, https://books.google.cl/books?id=E8qMjwEACAAJ
  30. J.L. Elman, Finding structure in time. Cognitive Sci. 14(2), 179–211 (1990)CrossRefGoogle Scholar
  31. H.J. Escalante, C.A. Hérnadez, L.E. Sucar, M. Montes. Late fusion of heterogeneous methods for multimedia image retrieval, in Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, MIR’08 (ACM, New York, 2008), pp. 172–179. ISBN 978-1-60558-312-9. doi: 10.1145/1460096.1460125
  32. H.J. Escalante, I. Guyon, V. Athitsos, P. Jangyodsuk, J. Wan, Principal motion components for gesture recognition using a single example, in PAA, 2015Google Scholar
  33. H.J. Escalante, E.F. Morales, L.E. Sucar, A naïve bayes baseline for early gesture recognition. PRL 73, 91–99 (2016a)CrossRefGoogle Scholar
  34. H.J. Escalante, V. Ponce, J. Wan, M. Riegler, A. Clapes, S. Escalera, I. Guyon, X. Baro, P. Halvorsen, H. Müller, M. Larson, Chalearn joint contest on multimedia challenges beyond visual analysis: an overview, in Proceedings of International Conference on Pattern Recognition, 2016bGoogle Scholar
  35. V. Escorcia, F.C. Heilbron, J.C. Niebles, B. Ghanem, DAPs: deep action proposals for action understanding, in European Conference on Computer Vision, 2016Google Scholar
  36. C. Feichtenhofer, A. Pinz, R. Wildes, Spatiotemporal residual networks for video action recognition, in Advances in Neural Information Processing Systems, 2016a, pp. 3468–3476Google Scholar
  37. C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016b, pp. 1933–1941Google Scholar
  38. B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, T. Tuytelaars, Rank pooling for action recognition, in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016Google Scholar
  39. D. Fortun, P. Bouthemy, C. Kervrann, Optical flow modeling and computation: a survey. Comput. Vis. Image Underst. 134, 1–21 (2015)CrossRefzbMATHGoogle Scholar
  40. F.A. Gers, N.N. Schraudolph, J. Schmidhuber, Learning precise timing with lstm recurrent networks. JMLR 3, 115–143 (2002)MathSciNetzbMATHGoogle Scholar
  41. G. Gkioxari, J. Malik, Finding action tubes, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 759–768Google Scholar
  42. A. Grushin, D.D. Monner, J.A. Reggia, A. Mishra, Robust human action recognition via long short-term memory, in The 2013 International Joint Conference on, Neural Networks (IJCNN), IEEE, 2013, pp. 1–8Google Scholar
  43. F. Gu, M. Sridhar, A. Cohn, D. Hogg, F. Flrez-Revuelta, D. Monekosso, P. Remagnino, Weakly supervised activity analysis with spatio-temporal localisation, Neurocomputing, 2016. ISSN 0925-2312. doi: 10.1016/j.neucom.2016.08.032, http://www.sciencedirect.com/science/article/
  44. S. Han, H. Mao, W. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, in Proceedings of International Conference on Learning Representations, 2016Google Scholar
  45. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016a, pp. 770–778Google Scholar
  46. Y. He, S. Shirakabe, Y. Satoh, H. Kataoka, Human action recognition without human, in Proceedings of European Conference on Computer Vision 2016 Workshops (Springer, New York, 2016b), pp. 11–17Google Scholar
  47. F.C. Heilbron, V. Escorcia, B. Ghanem, J.C. Niebles, Activitynet: a large-e video benchmark for human activity understanding, in CVPR, 2015, pp. 961–970Google Scholar
  48. S. Hochreiter, Untersuchungen zu dynamischen neuronalen netzen (Technische Universität München, Diploma, 1991), p. 91Google Scholar
  49. S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  50. J. Huang, W. Zhou, H. Li, W. Li, Sign language recognition using 3d convolutional neural networks, in ICME, 2015, pp. 1–6Google Scholar
  51. M.S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, G. Mori, A hierarchical deep temporal model for group activity recognition, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016Google Scholar
  52. A. Jain, J. Tompson, M. Andriluka, G.W. Taylor, C. Bregler, Learning human pose estimation features with convolutional networks, in International Conference on Learning Representations, Cornell University, 2014a, pp. 1–14Google Scholar
  53. A. Jain, J. Tompson, Y. LeCun, C. Bregler, MoDeep: a deep learning framework using motion features for human pose estimation, vol. 9004, 2015a, pp. 302–315Google Scholar
  54. M. Jain, J. van Gemert, C.G.M. Snoek, University of Amsterdam at thumos challenge, in ECCV THUMOS Challenge 2014 (Zürich, Switzerland, September, 2014), 2014bGoogle Scholar
  55. M. Jain, J.C. van Gemert, T. Mensink, C.G.M. Snoek. Objects2action: classifying and localizing actions without any video example, in IEEE ICCV, 2015b, arXiv.org/abs/1510.06939
  56. M. Jain, J.C. van Gemert, C.G. Snoek, What do 15,000 object categories tell us about classifying and localizing actions? in CVPR, 2015c, pp. 46–55Google Scholar
  57. S. Ji, W. Xu, M. Yang, K. Yu. 3d convolutional neural networks for human action recognition, in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 495–502Google Scholar
  58. S. Ji, W. Xu, M. Yang, K. Yu. 3d convolutional neural networks for human action recognition. IEEE TPAMI, vol. 35(1), 2013, pp. 221–231. ISSN 0162-8828. doi: 10.1109/TPAMI.2012.59
  59. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, in ACM MM (ACM, New York, 2014), pp. 675–678Google Scholar
  60. Y.-G. Jiang, J. Liu, A. Roshan Zamir, I. Laptev, M. Piccardi, M. Shah, R. Sukthankar, THUMOS challenge: action recognition with a large number of classes. ICCV13-Action-Workshop, 2013Google Scholar
  61. V. John, A. Boyali, S. Mita, M. Imanishi, N. Sanma. Deep learning-based fast hand gesture recognition using representative frames, in 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), IEEE, 2016, pp. 1–8Google Scholar
  62. J. Joo, W. Li, F.F. Steen, S.-C. Zhu. Visual persuasion: Inferring communicative intents of images, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 216–223Google Scholar
  63. B. Kang, S. Tripathi, T.Q. Nguyen, Real-time sign language fingerspelling recognition using convolutional neural networks from depth map, in ACPR, 2015, arXiv:abs/1509.03001
  64. S. Karaman, L. Seidenari, A.D. Bagdanov, A.D. Bimbo, L1-regularized logistic regression stacking and transductive crf smoothing for action recognition in video, in Results of the THUMOS 2013 Action Recognition Challenge with a Large Number of Classes, 2013Google Scholar
  65. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732Google Scholar
  66. T. Kerola, N. Inoue, K. Shinoda, Cross-view human action recognition from depth maps using spectral graph sequences. Comput. Vis. Image Underst. 154, 108–126 (2017)CrossRefGoogle Scholar
  67. O. Koller, H. Ney, R. Bowden, Deep hand: how to train a cnn on 1 million hand images when your data is continuous and weakly labelled, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3793–3802Google Scholar
  68. J. Konecny, M. Hagara, One-shot-learning gesture recognition using hog-hof features, in JMLR, vol. 15, 2014, pp. 2513–2532, http://jmlr.org/papers/v15/konecny14a.html
  69. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105Google Scholar
  70. Y. Kuniyoshi, H. Inoue, M. Inaba, Design and implementation of a system that generates assembly programs from visual recognition of human action sequences, in IEEE International Workshop on Intelligent Robots and Systems’ 90.’Towards a New Frontier of Applications’, Proceedings, IROS’90, IEEE, 1990, pp. 567–574Google Scholar
  71. G. Lev, G. Sadeh, B. Klein, L. Wolf, Rnn fisher vectors for action recognition and image annotation, in European Conference on Computer Vision (Springer, New York, 2016), pp. 833–850Google Scholar
  72. S. Li, Z.-Q. Liu, A.B. Chan, Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. IJCV, vol. 113(1), May 2015a, pp. 19–36. ISSN 0920-5691. doi: 10.1007/s11263-014-0767-8
  73. S. Li, W. Zhang, A.B. Chan, Maximum-margin structured learning with deep networks for 3d human pose estimation, in ICCV, 2015b, pp. 2848–2856Google Scholar
  74. Y. Li, W. Li, V. Mahadevan, N. Vasconcelos, Vlad3: encoding dynamics of deep features for action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016a, pp. 1951–1960Google Scholar
  75. Y. Li, Q. Miao, K. Tian, Y. Fan, X. Xu, R. Li, J. Song, Large-scale gesture recognition with a fusion of rgb-d data based on c3d model, in Proceedings of International Conference on Pattern RecognitionW, 2016bGoogle Scholar
  76. C. Liang, Y. Song, Y. Zhang, Hand gesture recognition using view projection from point cloud, in 2016 IEEE International Conference on Image Processing (ICIP), IEEE, 2016, pp. 4413–4417Google Scholar
  77. Z. Liang, G. Zhang, J.X. Huang, Q.V. Hu, Deep learning for healthcare decision making with emrs, in 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, 2014, pp. 556–559Google Scholar
  78. H.-I. Lin, M.-H. Hsu, W.-K. Chen, Human hand gesture recognition using a convolution neural network, in CASE, 2015, pp. 1038–1043Google Scholar
  79. A.-A. Liu, Y.-T. Su, W.-Z. Nie, M. Kankanhalli, Hierarchical clustering multi-task learning for joint human action grouping and recognition. TPAMI 39(1), 102–114 (2017)CrossRefGoogle Scholar
  80. J. Liu, A. Shahroudy, D. Xu, G. Wang, Spatio-temporal lstm with trust gates for 3d human action recognition, in European Conference on Computer Vision (Springer, New York, 2016a), pp. 816–833Google Scholar
  81. Z. Liu, C. Zhang, Y. Tian, 3d-based deep convolutional neural network for action recognition with depth sequences. Image Vis. Comput. 55, 93–100 (2016b)CrossRefGoogle Scholar
  82. J. Luo, W. Wang, H. Qi, Group sparsity and geometry constrained dictionary learning for action recognition from depth maps, in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1809–1816Google Scholar
  83. B. Mahasseni, S. Todorovic, Regularizing long short term memory with 3d human-skeleton sequences for action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3054–3062Google Scholar
  84. E. Mansimov, N. Srivastava, R. Salakhutdinov, Initialization strategies of spatio-temporal convolutional neural networks, 2015, arXiv:1503.07274
  85. R. Marks, System and method for providing a real-time three-dimensional interactive environment, Dec. 6 2011. US Patent 8,072,470Google Scholar
  86. P. Mettes, J.C. van Gemert, C.G. Snoek, Spot on: action localization from pointly-supervised proposals, in European Conference on Computer Vision (Springer, New York, 2016), pp. 437–453Google Scholar
  87. V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski et al., Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)CrossRefGoogle Scholar
  88. P. Molchanov, S. Gupta, K. Kim, J. Kautz, Hand gesture recognition with 3d convolutional neural networks, in CVPRW, June 2015, pp. 1–7. doi: 10.1109/CVPRW.2015.7301342
  89. P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, J. Kautz, Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network, in CVPR, 2016Google Scholar
  90. A. Montes, A. Salvador, X. Giro-i Nieto, Temporal activity detection in untrimmed videos with recurrent neural networks, 2016, arXiv:1608.08128
  91. H. Mousavi Hondori, M. Khademi, A review on technical and clinical impact of microsoft kinect on physical therapy and rehabilitation. J. Med. Eng. (2014). doi: 10.1155/2014/846514
  92. K. Nasrollahi, S. Escalera, P. Rasti, G. Anbarjafari, X. Bar, H.J. Escalante, T.B. Moeslund, Deep learning based super-resolution for improved action recognition, in IPTA, 2015, pp. 67–72. ISBN 978-1-4799-8637-8, http://dblp.uni-trier.de/db/conf/ipta/ipta2015.html#NasrollahiERABE15
  93. N. Neverova, C. Wolf, G. Paci, G. Sommavilla, G.W. Taylor, F. Nebout, A multi-scale approach to gesture detection and recognition, in ICCVW, 2013, pp. 484–491, http://liris.cnrs.fr/publis/?id=6330
  94. N. Neverova, C. Wolf, G.W. Taylor, F. Nebout, Multi-scale deep learning for gesture detection and localization. ECCVW. LNCS 8925, 474–490 (2014)Google Scholar
  95. N. Neverova, C. Wolf, G.W. Taylor, F. Nebout, Hand segmentation with structured convolutional learning, in ACCV. LNCS, vol. 9005, 2015a, pp. 687–702. ISBN 978-3-319-16811-1. doi: 10.1007/978-3-319-16811-1_45
  96. N. Neverova, C. Wolf, G.W. Taylor, F. Nebout, Moddrop: adaptive multi-modal gesture recognition, in IEEE TPAMI, 2015bGoogle Scholar
  97. J.Y.-H. Ng, J. Choi, J. Neumann, L.S. Davis, Actionflownet: learning motion representation for action recognition, 2016, arXiv:1612.03052
  98. B. Ni, Y. Pei, Z. Liang, L. Lin, P. Moulin, Integrating multi-stage depth-induced contextual information for human action recognition and localization, in FG, April 2013, pp 1–8. doi: 10.1109/FG.2013.6553756
  99. B. Ni, X. Yang, S. Gao, Progressively parsing interactional objects for fine grained action detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1020–1028Google Scholar
  100. N. Nishida, H. Nakayama, Multimodal gesture recognition using multi-stream recurrent neural network, in PSIVT, 2016, pp. 682–694Google Scholar
  101. S. Oh, A large-scale benchmark dataset for event recognition in surveillance video, in CVPR, 2011, pp. 3153–3160. ISBN 978-1-4577-0394-2. doi: 10.1109/CVPR.2011.5995586
  102. E. Ohn-Bar, M.M. Trivedi, Hand gesture recognition in real time for automotive interfaces: a multimodal vision-based approach and evaluations, in IEEE-ITS, vol. 15(6), Dec 2014, pp. 2368–2377. ISSN 1524-9050. doi: 10.1109/TITS.2014.2337331
  103. F.J. Ordóñez, D. Roggen, Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16(1), 115 (2016)CrossRefGoogle Scholar
  104. W. Ouyang, X. Chu, X. Wang, Multi-source deep learning for human pose estimation, in CVPR, 2014, pp. 2337–2344Google Scholar
  105. O.K. Oyedotun, A. Khashman, Deep learning in vision-based static hand gesture recognition, in Neural Computing and Applications, 2016, pp. 1–11Google Scholar
  106. E. Park, X. Han, T.L. Berg, A.C. Berg, Combining multiple sources of knowledge in deep cnns for action recognition, in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2016, pp. 1–8Google Scholar
  107. X. Peng, C. Schmid, Encoding feature maps of cnns for action recognition, in CVPR, THUMOS Challenge 2015 Workshop, 2015Google Scholar
  108. X. Peng, C. Schmid, Multi-region two-stream r-cnn for action detection, in European Conference on Computer Vision (Springer, New York, 2016), pp. 744–759Google Scholar
  109. X. Peng, L. Wang, Z. Cai, Y. Qiao, Q. Peng, Hybrid super vector with improved dense trajectories for action recognition, in ICCV Workshops, vol. 13, 2013Google Scholar
  110. X. Peng, C. Zou, Y. Qiao, Q. Peng, Action recognition with stacked fisher vectors, in European Conference on Computer Vision (Springer, New York, 2014), pp. 581–595Google Scholar
  111. X. Peng, L. Wang, Z. Cai, Y. Qiao, Action and Gesture Temporal Spotting with Super Vector Representation, 2015, pp. 518–527. ISBN 978-3-319-16178-5. doi: 10.1007/978-3-319-16178-5_36
  112. L. Pigou, S. Dieleman, P.-J. Kindermans, B. Schrauwen, Sign language recognition using convolutional neural networks, in European Conference on Computer Vision’14, 2015a, pp. 572–578. ISBN 978-3-319-16178-5. doi: 10.1007/978-3-319-16178-5_40
  113. L. Pigou, A.V.D. Oord, S. Dieleman, M.V. Herreweghe, J. Dambre, Beyond temporal pooling: recurrence and temporal convolutions for gesture recognition in video. CoRR, 2015b, arXiv.org/abs/1506.01911
  114. Y. Poleg, A. Ephrat, S. Peleg, C. Arora, Compact cnn for indexing egocentric videos, in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2016, pp. 1–9Google Scholar
  115. Z. Qiu, Q. Li, T. Yao, T. Mei, Y. Rui, Msr asia msm at thumos challenge 2015, in CVPR Workshop, vol. 8 (2015)Google Scholar
  116. A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, in Proceedings of International Conference on Learning Representations, 2016Google Scholar
  117. H. Rahmani, A. Mian, 3d action recognition from novel viewpoints, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1506–1515Google Scholar
  118. H. Rahmani, A. Mian, and M. Shah. Learning a deep model for human action recognition from novel viewpoints, arXiv preprint arXiv:1602.00828
  119. S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection with region proposal networks, in Advances in neural information processing systems, 2015, pp. 91–99Google Scholar
  120. N. Rhinehart, K.M. Kitani, Learning action maps of large environments via first-person vision, in Proceedings of European Conference on Computer Vision, 2016Google Scholar
  121. A. Richard, J. Gall, Temporal action detection using a statistical language model, in CVPR, 2016Google Scholar
  122. H. Sagha, J. del R. Milln, R. Chavarriaga, Detecting anomalies to improve classification performance in opportunistic sensor networks, in PERCOM Workshops, March 2011a, pp. 154–159. doi: 10.1109/PERCOMW.2011.5766860
  123. H. Sagha, S.T. Digumarti, J. del R. Millán, R. Chavarriaga, A. Calatroni, D. Roggen, G. Tröster, Benchmarking classification techniques using the opportunity human activity dataset, in IEEE SMC, Oct 2011b, pp. 36 –40. doi: 10.1109/ICSMC.2011.6083628
  124. S. Saha, G. Singh, M. Sapienza, P.H. Torr, F. Cuzzolin, Deep learning for detecting multiple space-time action tubes in videos, 2016, arXiv:1608.01529
  125. J. Scharcanski, M.E. Celebi, Computer vision techniques for the diagnosis of skin cancer (Springer, New York, 2014)CrossRefGoogle Scholar
  126. A. Shahroudy, J. Liu, T.-T. Ng, G. Wang, NTU RGB+ D: a large scale dataset for 3d human activity analysis, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016a, pp. 1010–1019Google Scholar
  127. A. Shahroudy, T.-T. Ng, Y. Gong, G. Wang, Deep multimodal feature analysis for action recognition in RGB+ D videos, 2016b, arXiv:1603.07120
  128. L. Shao, L. Liu, M. Yu, Kernelized multiview projection for robust action recognition. Int. J. Comput. Vis. 118(2), 115–129, June 2016, http://nrl.northumbria.ac.uk/24276/
  129. Z. Shou, D. Wang, S.-F. Chang, Temporal action localization in untrimmed videos via multi-stage CNNS, in CVPR, 2016aGoogle Scholar
  130. Z. Shou, D. Wang, S.-F. Chang, Temporal action localization in untrimmed videos via multi-stage CNNS. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016b, pp. 1049–1058Google Scholar
  131. Z. Shu, K. Yun, D. Samaras, Action Detection with Improved Dense Trajectories and Sliding Window, Cham, 2015, pp. 541–551. ISBN 978-3-319-16178-5. doi: 10.1007/978-3-319-16178-5_38
  132. K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in NIPS, 2014, pp. 568–576Google Scholar
  133. B. Singh, T.K. Marks, M. Jones, O. Tuzel, M. Shao, A multi-stream bi-directional recurrent neural network for fine-grained action detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016a, pp. 1961–1970Google Scholar
  134. S. Singh, C. Arora, C. Jawahar, First person action recognition using deep learned descriptors, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016b, pp. 2620–2628Google Scholar
  135. K. Soomro, H. Idrees, M. Shah, Action localization in videos through context walk, in ICCV, 2015Google Scholar
  136. W. Sultani, M. Shah, Automatic action annotation in weakly labeled videos. CoRR, 2016, arXiv.org/abs/1605.08125
  137. L. Sun, K. Jia, D.-Y. Yeung, B.E. Shi, Human action recognition using factorized spatio-temporal convolutional networks, in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4597–4605Google Scholar
  138. J. Tompson, Y.L. Murphy Stein, K. Perlin, Real-time continuous pose recovery of human hands using convolutional networks. ACM-ToG, 33(5), 169:1–169:10 (2014). ISSN 0730-0301. doi: 10.1145/2629500
  139. D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in 2015 IEEE International Conference on Computer Vision (ICCV), IEEE, 2015, pp. 4489–4497Google Scholar
  140. P. Turaga, A. Veeraraghavan, R. Chellappa, Statistical analysis on Stiefel and Grassmann manifolds with applications in computer vision, in CVPR, IEEE, 2008, pp. 1–8Google Scholar
  141. J.R. Uijlings, K.E. Van De Sande, T. Gevers, A.W. Smeulders, Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)CrossRefGoogle Scholar
  142. G. Varol, I. Laptev, C. Schmid, Long-term temporal convolutions for action recognition, 2016, arXiv:1604.04494
  143. V. Veeriah, N. Zhuang, G.-J. Qi, Differential recurrent neural networks for action recognition, in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4041–4049Google Scholar
  144. S. Vishwakarma, A. Agrawal, A survey on activity recognition and behavior understanding in video surveillance. Visual Comput. 29(10), 983–1009 (2013)CrossRefGoogle Scholar
  145. C. Vondrick, D. Ramanan, Video annotation and tracking with active learning, in NIPS, 2011Google Scholar
  146. A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K.J. Lang, Phoneme recognition using time-delay neural networks, in Readings in Speech Recognition, 1990, pp. 393–404Google Scholar
  147. H. Wang, D. Oneata, J. Verbeek, C. Schmid, A robust and efficient video representation for action recognition. Int. J. Comput. Vis. 119, 1–20 (2015a)MathSciNetGoogle Scholar
  148. H. Wang, W. Wang, L. Wang, How scenes imply actions in realistic videos? in ICIP IEEE, 2016a, pp. 1619–1623Google Scholar
  149. J. Wang, W. Wang, R. Wang, W. Gao, et al., Deep alternative neural network: exploring contexts as early as possible for action recognition, in Advances in Neural Information Processing Systems, 2016b, pp. 811–819Google Scholar
  150. L. Wang, Y. Qiao, X. Tang, Action recognition with trajectory-pooled deep-convolutional descriptors, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015b, pp. 4305–4314Google Scholar
  151. L. Wang, Z. Wang, Y. Xiong, Y. Qiao, CUHK&SIAT submission for THUMOS15 action recognition challenge, in THUMOS Action Recognition challenge, 2015c, pp. 1–3Google Scholar
  152. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, Towards good practices for very deep two-stream convnets, 2015d, arXiv:1507.02159
  153. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L. Van Gool, Temporal segment networks: towards good practices for deep action recognition, in European Conference on Computer Vision (Springer, New York, 2016c), pp. 20–36Google Scholar
  154. P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, P.O. Ogunbona, Action recognition from depth maps using deep convolutional neural networks. IEEE Trans. Hum.-Mach. Syst. 46(4), 498–509 (2016d)Google Scholar
  155. P. Wang, W. Li, S. Liu, Y. Zhang, Z. Gao, P. Ogunbona, Large-scale continuous gesture recognition using convolutional neural networks, in Proceedings of International Conference on Pattern RecognitionW, 2016eGoogle Scholar
  156. P. Wang, Q. Song, H. Han, J. Cheng, Sequentially supervised long short-term memory for gesture recognition, in Cognitive Computation, 2016f, pp. 1–10Google Scholar
  157. P. Wang, W. Li, S. Liu, Z. Gao, C. Tang, P. Ogunbona, Large-scale isolated gesture recognition using convolutional neural networks, 2017, arXiv:1701.01814
  158. X. Wang, A. Farhadi, A. Gupta, Actions\(\tilde{\,}\) transformations, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016g, pp. 2658–2667Google Scholar
  159. Y. Wang, M. Hoai, Improving human action recognition by non-action classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2698–2707Google Scholar
  160. Y. Wang, J. Song, L. Wang, L. Van Gool, O. Hilliges, Two-stream SR-CNNS for action recognition in videos, BMVC, 2016hGoogle Scholar
  161. Z. Wang, L. Wang, W. Du, Y. Qiao, Exploring fisher vector and deep networks for action spotting, in CVPRW, 2015e, pp. 10–14. doi: 10.1109/CVPRW.2015.7301330
  162. P. Weinzaepfel, Z. Harchaoui, C. Schmid, Learning to track for spatio-temporal action localization, in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3164–3172Google Scholar
  163. P. Weinzaepfel, Z. Harchaoui, C. Schmid, Learning to track for spatio-temporal action localization, in ICCV, Santiago, Chile, Dec 2015, arXiv: 1506.01929
  164. P.A. Wilson, B. Lewandowska-Tomaszczyk, Affective robotics: modelling and testing cultural prototypes. Cogn. Comput. 6(4), 814–840 (2014)CrossRefGoogle Scholar
  165. C. Wolf, E. Lombardi, J. Mille, O. Celiktutan, M. Jiu, E. Dogan, G. Eren, M. Baccouche, E. Dellandréa, C.-E. Bichot, C. Garcia, B. Sankur, Evaluation of video activity localizations integrating quality and quantity measurements, in CVIU, vol. 127, Oct 2014, pp. 14–30. ISSN 1077-3142. doi: 10.1016/j.cviu.2014.06.014
  166. D. Wu, L. Pigou, P.J. Kindermans, N. Le, L. Shao, J. Dambre, J.M. Odobez, Deep dynamic neural networks for multimodal gesture segmentation and recognition, in IEEE TPAMI, Feb 2016aGoogle Scholar
  167. J. Wu, J. Cheng, C. Zhao, H. Lu, Fusing multi-modal features for gesture recognition, in ICMI, 2013, pp. 453–460. ISBN 978-1-4503-2129-7. doi: 10.1145/2522848.2532589
  168. J. Wu, P. Ishwar, J. Konrad, Two-stream CNNS for gesture-based verification and identification: learning user style, in CVPRW, 2016bGoogle Scholar
  169. J. Wu, G. Wang, W. Yang, X. Ji, Action recognition with joint attention on multi-level deep features, 2016c, arXiv:1607.02556
  170. Z. Wu, Y. Fu, Y.-G. Jiang, L. Sigal, Harnessing object and scene semantics for large-scale video understanding, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016d, pp. 3112–3121Google Scholar
  171. X. Xu, T.M. Hospedales, S. Gong, Multi-task zero-shot action recognition with prioritised data augmentation, in Proceedings of European Conference on Computer Vision, 2016Google Scholar
  172. Z. Xu, L. Zhu, Y. Yang, A.G. Hauptmann, UTS-CMU at THUMOS 2015, in CVPR THUMOS Challenge, 2015aGoogle Scholar
  173. Z. Xu, L. Zhu, Y. Yang, A.G. Hauptmann, UTS-CMU at THUMOS, 2015bGoogle Scholar
  174. J. Yamato, J. Ohya, K. Ishii, Recognizing human action in time-sequential images using hidden Markov model, in 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1992. Proceedings CVPR’92, IEEE, 1992, pp. 379–385Google Scholar
  175. Y. Ye, Y. Tian, Embedding sequential information into spatiotemporal features for action recognition, in CVPRW, 2016Google Scholar
  176. S. Yeung, O. Russakovsky, G. Mori, L. Fei-Fei, End-to-end learning of action detection from frame glimpses in videos, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2678–2687Google Scholar
  177. D. Yu, A. Eversole, M. Seltzer, K. Yao, Z. Huang, B. Guenter, O. Kuchaiev, Y. Zhang, F. Seide, H. Wang et al., An introduction to computational networks and the computational network toolkit (Technical Report, TR MSR, 2014)Google Scholar
  178. J. Yu, K. Weng, G. Liang, G. Xie, A vision-based robotic grasping system using deep learning for 3d object recognition and pose estimation, in 2013 IEEE International Conference on Robotics and Biomimetics (ROBIO), IEEE, 2013, pp. 1175–1180Google Scholar
  179. J. Yuan, B. Ni, X. Yang, A. Kassim, Temporal action localization with pyramid of score distribution features, in CVPR, 2016Google Scholar
  180. J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici, Beyond short snippets: deep networks for video classification, in CVPR, 2015, pp. 4694–4702Google Scholar
  181. S. Zha, F. Luisier, W. Andrews, N. Srivastava, R. Salakhutdinov, Exploiting image-trained cnn architectures for unconstrained video classification, 2015, arXiv:1503.04144
  182. B. Zhang, L. Wang, Z. Wang, Y. Qiao, H. Wang, Real-time action recognition with enhanced motion vector CNNS, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2718–2726Google Scholar
  183. B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, A. Oliva, Learning deep features for scene recognition using places database, in NIPS, 2014, pp. 487–495Google Scholar
  184. T. Zhou, N. Li, X. Cheng, Q. Xu, L. Zhou, Z. Wu, Learning semantic context feature-tree for action recognition via nearest neighbor fusion. Neurocomputing 201, 1–11 (2016)CrossRefGoogle Scholar
  185. Y. Zhou, B. Ni, R. Hong, M. Wang, Q. Tian, Interaction part mining: a mid-level approach for fine-grained action recognition, in CVPR, 2015, pp. 3323–3331Google Scholar
  186. G. Zhu, L. Zhang, L. Mei, J. Shao, J. Song, P. Shen, Large-scale isolated gesture recognition using pyramidal 3d convolutional networks, in Proceedings of International Conference on Pattern RecognitionW, 2016aGoogle Scholar
  187. W. Zhu, J. Hu, G. Sun, X. Cao, Y. Qiao, A key volume mining deep framework for action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016b, pp. 1991–1999Google Scholar
  188. C.L. Zitnick, P. Dollár, Edge boxes: locating object proposals from edges, in European Conference on Computer Vision (Springer, New York, 2014), pp. 391–405Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Maryam Asadi-Aghbolaghi
    • 1
    • 2
    • 3
  • Albert Clapés
    • 2
    • 3
  • Marco Bellantonio
    • 4
  • Hugo Jair Escalante
    • 5
  • Víctor Ponce-López
    • 6
  • Xavier Baró
    • 7
  • Isabelle Guyon
    • 8
    • 9
  • Shohreh Kasaei
    • 1
  • Sergio Escalera
    • 2
    • 3
  1. 1.Department of Computer EngineeringSharif University of TechnologyTehranIran
  2. 2.Computer Vision CenterAutonomous University of BarcelonaBarcelonaSpain
  3. 3.Department of Mathematics and InformaticsUniversity of BarcelonaBarcelonaSpain
  4. 4.Facultat d’InformaticaPolytechnic University of BarcelonaBarcelonaSpain
  5. 5.Instituto Nacional de Astrofísica, Óptica Y ElectrónicaPueblaMexico
  6. 6.Eurecat, BarcelonaCataloniaSpain
  7. 7.EIMTOpen University of CataloniaBarcelonaSpain
  8. 8.UPSud and INRIAUniversité Paris-SaclayParisFrance
  9. 9.ChaLearnBerkeleyUSA

Personalised recommendations