Multimedia Tools and Applications

, Volume 78, Issue 19, pp 27309–27331 | Cite as

LSTM-based real-time action detection and prediction in human motion streams

  • Fabio CarraraEmail author
  • Petr Elias
  • Jan Sedmidubsky
  • Pavel Zezula


Motion capture data digitally represent human movements by sequences of 3D skeleton configurations. Such spatio-temporal data, often recorded in the stream-based nature, need to be efficiently processed to detect high-interest actions, for example, in human-computer interaction to understand hand gestures in real time. Alternatively, automatically annotated parts of a continuous stream can be persistently stored to become searchable, and thus reusable for future retrieval or pattern mining. In this paper, we focus on multi-label detection of user-specified actions in unsegmented sequences as well as continuous streams. In particular, we utilize the current advances in recurrent neural networks and adopt a unidirectional LSTM model to effectively encode the skeleton frames within the hidden network states. The model learns what subsequences of encoded frames belong to the specified action classes within the training phase. The learned representations of classes are then employed within the annotation phase to infer the probability that an incoming skeleton frame belongs to a given action class. The computed probabilities are finally compared against a learned threshold to automatically determine the beginnings and endings of actions. To further enhance the annotation accuracy, we utilize a bidirectional LSTM model to estimate class probabilities by considering not only the past frames but also the future ones. We extensively evaluate both the models on the three use cases of real-time stream annotation, offline annotation of long sequences, and early action detection and prediction. The experiments demonstrate that our models outperform the state of the art in effectiveness and are at least one order of magnitude more efficient, being able to annotate 10 k frames per second.


Motion capture data Stream annotation Action detection and recognition Action prediction LSTM 



This research was supported by Smart News, “Social sensing for breaking news”, CUP CIPE D58C15000270008, by Automatic Data and documents Analysis to enhance human-based processes (ADA), CUP CIPE D55F17000290009, and by ERDF “CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence” (No. CZ.02.1.01/0.0/0.0/16_019/0000822). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.


  1. 1.
    Aberman K, Wu R, Lischinski D, Chen B, Cohen-Or D (2019) Learning character-agnostic motion for motion retargeting in 2d. ACM Trans Graph 38(4). arXiv:1905.01680
  2. 2.
    Asadi-Aghbolaghi M, Clapés A, Bellantonio M, Escalante HJ, Ponce-López V, Baró X, Guyon I, Kasaei S, Escalera S (2017) A survey on deep learning based approaches for action and gesture recognition in image sequences. In: 2017 12th IEEE international conference on automatic face gesture recognition (FG 2017), pp 476–483Google Scholar
  3. 3.
    Baltrušaitis T, Ahuja C, Morency L (2019) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443CrossRefGoogle Scholar
  4. 4.
    Barbič J, Safonova A, Pan JY, Faloutsos C, Hodgins JK, Pollard NS (2004) Segmenting motion capture data into distinct behaviors. In: Proceedings of graphics interface 2004. Canadian Human-Computer Communications Society, pp 185–194Google Scholar
  5. 5.
    Barnachon M, Bouakaz S, Boufama B, Guillou E (2014) Ongoing human action recognition with motion capture. Pattern Recogn 47(1):238–247CrossRefGoogle Scholar
  6. 6.
    Boulahia SY, Anquetil E, Multon F, Kulpa R (2018) Cudi3d: curvilinear displacement based approach for online 3d action detection. In: Computer vision and image understandingGoogle Scholar
  7. 7.
    Butepage J, Black MJ, Kragic D, Kjellstrom H (2017) Deep representation learning for human motion prediction and classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6158–6166Google Scholar
  8. 8.
    Cao Z, Simon T, Wei S, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 1302–1310Google Scholar
  9. 9.
    Chen C, Jafari R, Kehtarnavaz N (2017) A survey of depth and inertial sensor fusion for human action recognition. Multimed Tools Appl 76(3):4405–4425CrossRefGoogle Scholar
  10. 10.
    Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: 2015 IEEE conference on computer vision and pattern recognition, pp 1110–1118Google Scholar
  11. 11.
    Elias P, Sedmidubsky J, Zezula P (2017) A real-time annotation of motion data streams. In: 19th International symposium on multimedia. IEEE Computer Society, pp 154–161Google Scholar
  12. 12.
    Evangelidis G, Singh G, Horaud R (2014) Skeletal quads: human action recognition using joint quadruples. In: 22nd International conference on pattern recognition (ICPR 2014), pp 4513–4518Google Scholar
  13. 13.
    Field M, Stirling D, Pan Z, Ros M, Naghdy F (2015) Recognizing human motions through mixture modeling of inertial data. Pattern Recognit 48(8):2394–2406CrossRefGoogle Scholar
  14. 14.
    Fothergill S, Mentis H, Kohli P, Nowozin S (2012) Instructing people for training gestural interactive systems. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’12. ACM, New York, pp 1737–1746Google Scholar
  15. 15.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  16. 16.
    Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. In: Joint conference on artificial intelligence (IJCAI 2013), pp 2466–2472Google Scholar
  17. 17.
    Jain A, Zamir AR, Savarese S, Saxena A (2016) Structural-rnn: deep learning on spatio-temporal graphs. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 5308–5317Google Scholar
  18. 18.
    Kadu H, Kuo CCJ (2014) Automatic human mocap data classification. IEEE Trans Multimedia 16(8):2191–2202CrossRefGoogle Scholar
  19. 19.
    Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412:6980
  20. 20.
    Kratz L, Smith M, Lee F (2007) Wiizards: 3d gesture recognition for game play input. In: Proceedings of the 2007 conference on future play. Future play ’07, pp 209–212Google Scholar
  21. 21.
    Krüger B, Vögele A, Willig T, Yao A, Klein R, Weber A (2017) Efficient unsupervised temporal segmentation of motion data. IEEE Trans Multimedia 19(4):797–812CrossRefGoogle Scholar
  22. 22.
    Lakens D (2010) Movement synchrony and perceived entitativity. J Exp Soc Psychol 46(5):701–708CrossRefGoogle Scholar
  23. 23.
    Laraba S, Brahimi M, Tilmanne J, Dutoit T (2017) 3d skeleton-based action recognition by representing motion capture sequences as 2d-rgb images. Comput Anim Virtual Worlds 28(3–4)Google Scholar
  24. 24.
    Li Y, Lan C, Xing J, Zeng W, Yuan C, Liu J (2016) Online human action detection using joint classification-regression recurrent neural networks. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision—ECCV 2016. Springer International Publishing, Cham, pp 203–220Google Scholar
  25. 25.
    Li K, He FZ, Yu HP, Chen X (2017) A correlative classifiers approach based on particle filter and sample set for tracking occluded target. Appl Math–A Journal of Chinese Universities 32(3):294–312MathSciNetCrossRefGoogle Scholar
  26. 26.
    Li K, He FZ, Yu HP (2018) Robust visual tracking based on convolutional features with illumination and occlusion handing. J Comput Sci Technol 33(1):223–236CrossRefGoogle Scholar
  27. 27.
    Li S, Li K, Fu Y (2018) Early recognition of 3d human actions. ACM Trans Multimedia Comput Commun Appl 14(1s):20:1–20:21CrossRefGoogle Scholar
  28. 28.
    Liu J, Wang G, Duan L, Hu P, Kot AC (2018) Skeleton based human action recognition with global context-aware attention LSTM networks. IEEE Trans Image Process 27(4):1586–1599MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Ma S, Sigal L, Sclaroff S (2016) Learning activity progression in lstms for activity detection and early detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 1942–1950Google Scholar
  30. 30.
    Müller M, Röder T, Clausen M, Eberhardt B, Krüger B, Weber A (2007) Documentation Mocap Database HDM05. Tech. Rep. CG-2007-2, Universität BonnGoogle Scholar
  31. 31.
    Müller M, Baak A, Seidel HP (2009) Efficient and robust annotation of motion capture data. In: ACM SIGGRAPH/Eurographics symposium on computer animation (SCA 2009). ACM Press, pp 17–26Google Scholar
  32. 32.
    Nunez JC, Cabido R, Pantrigo JJ, Montemayor AS, Velez JF (2018) Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recogn 76:80–94CrossRefGoogle Scholar
  33. 33.
    Poppe R, Van Der Zee S, Heylen DKJ, Taylor PJ (2014) Amab: automated measurement and analysis of body motion. Behav Res Methods 46(3):625–633Google Scholar
  34. 34.
    Raptis M, Kirovski D, Hoppe H (2011) Real-time classification of dance gestures from skeleton animation. In: ACM SIGGRAPH Eurographics symposium on computer animation (SCA 2011), SCA 2011. ACM, pp 147–156Google Scholar
  35. 35.
    Sedmidubsky J, Elias P, Zezula P (2018) Effective and efficient similarity searching in motion capture data. Multimed Tools Appl 77(10):12,073–12,094CrossRefGoogle Scholar
  36. 36.
    Singh D, Merdivan E, Psychoula I, Kropf J, Hanke S, Geist M, Holzinger A (2017) Human activity recognition using recurrent neural networks. In: Holzinger A, Kieseberg P, Tjoa AM, Weippl E (eds) Machine learning and knowledge extraction. Springer International Publishing, Cham, pp 267–274Google Scholar
  37. 37.
    Song S, Lan C, Xing J, Zeng W, Liu J (2018) Spatio-temporal attention-based lstm networks for 3d action recognition and detection. IEEE Trans Image Process 27(7):3459–3471MathSciNetCrossRefzbMATHGoogle Scholar
  38. 38.
    Vieira A, Lewiner T, Schwartz W, Campos M (2012) Distance matrices as invariant features for classifying mocap data. In: 21st International conference on pattern recognition (ICPR 2012), pp 2934–2937Google Scholar
  39. 39.
    Wang Y, Neff M (2015) Deep signatures for indexing and retrieval in large motion databases. In: 8th ACM SIGGRAPH conference on motion in games. ACM, pp 37–45Google Scholar
  40. 40.
    Wang C, Wang Y, Yuille AL (2013) An approach to pose-based action recognition. In: Proceedings of the 2013 IEEE conference on computer vision and pattern recognition, CVPR ’13. IEEE Computer Society, pp 915–922Google Scholar
  41. 41.
    Wu D, Shao L (2014) Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In: 2014 IEEE conference on computer vision and pattern recognition, pp 724–731Google Scholar
  42. 42.
    Xia L, Chen CC, Aggarwal JK (2012) View invariant human action recognition using histograms of 3D joints. In: CVPR workshops, pp 20–27Google Scholar
  43. 43.
    Xu Y, Shen Z, Zhang X, Gao Y, Deng S, Wang Y, Fan Y, Chang EC (2017) Learning multi-level features for sensor-based human action recognition. Pervasive Mob Comput 40:324–338CrossRefGoogle Scholar
  44. 44.
    Yu X, Liu W, Xing W (2017) Behavioral segmentation for human motion capture data based on graph cut method. J Vis Lang Comput 43:50–59CrossRefGoogle Scholar
  45. 45.
    Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: an efficient 3d kinematics descriptor for low-latency action recognition and detection. In: International conference on computer vision (ICCV 2013), pp 2752–2759Google Scholar
  46. 46.
    Zhao X, Li X, Pang C, Sheng QZ, Wang S, Ye M (2014) Structured streaming skeleton—a new feature for online human gesture recognition. ACM Trans Multimedia Comput Commun Appl 11(1s):22:1–22:18CrossRefGoogle Scholar
  47. 47.
    Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: 30th AAAI conference on artificial intelligence, AAAI 2016. AAAI Press, pp 3697–3703Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.ISTI-CNRPisaItaly
  2. 2.Masaryk UniversityBrnoCzech Republic

Personalised recommendations