LSTM-based real-time action detection and prediction in human motion streams


Motion capture data digitally represent human movements by sequences of 3D skeleton configurations. Such spatio-temporal data, often recorded in the stream-based nature, need to be efficiently processed to detect high-interest actions, for example, in human-computer interaction to understand hand gestures in real time. Alternatively, automatically annotated parts of a continuous stream can be persistently stored to become searchable, and thus reusable for future retrieval or pattern mining. In this paper, we focus on multi-label detection of user-specified actions in unsegmented sequences as well as continuous streams. In particular, we utilize the current advances in recurrent neural networks and adopt a unidirectional LSTM model to effectively encode the skeleton frames within the hidden network states. The model learns what subsequences of encoded frames belong to the specified action classes within the training phase. The learned representations of classes are then employed within the annotation phase to infer the probability that an incoming skeleton frame belongs to a given action class. The computed probabilities are finally compared against a learned threshold to automatically determine the beginnings and endings of actions. To further enhance the annotation accuracy, we utilize a bidirectional LSTM model to estimate class probabilities by considering not only the past frames but also the future ones. We extensively evaluate both the models on the three use cases of real-time stream annotation, offline annotation of long sequences, and early action detection and prediction. The experiments demonstrate that our models outperform the state of the art in effectiveness and are at least one order of magnitude more efficient, being able to annotate 10 k frames per second.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. 1.

    The code to reproduce the experiments is publicly available at


  1. 1.

    Aberman K, Wu R, Lischinski D, Chen B, Cohen-Or D (2019) Learning character-agnostic motion for motion retargeting in 2d. ACM Trans Graph 38(4). arXiv:1905.01680

  2. 2.

    Asadi-Aghbolaghi M, Clapés A, Bellantonio M, Escalante HJ, Ponce-López V, Baró X, Guyon I, Kasaei S, Escalera S (2017) A survey on deep learning based approaches for action and gesture recognition in image sequences. In: 2017 12th IEEE international conference on automatic face gesture recognition (FG 2017), pp 476–483

  3. 3.

    Baltrušaitis T, Ahuja C, Morency L (2019) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443

    Article  Google Scholar 

  4. 4.

    Barbič J, Safonova A, Pan JY, Faloutsos C, Hodgins JK, Pollard NS (2004) Segmenting motion capture data into distinct behaviors. In: Proceedings of graphics interface 2004. Canadian Human-Computer Communications Society, pp 185–194

  5. 5.

    Barnachon M, Bouakaz S, Boufama B, Guillou E (2014) Ongoing human action recognition with motion capture. Pattern Recogn 47(1):238–247

    Article  Google Scholar 

  6. 6.

    Boulahia SY, Anquetil E, Multon F, Kulpa R (2018) Cudi3d: curvilinear displacement based approach for online 3d action detection. In: Computer vision and image understanding

  7. 7.

    Butepage J, Black MJ, Kragic D, Kjellstrom H (2017) Deep representation learning for human motion prediction and classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6158–6166

  8. 8.

    Cao Z, Simon T, Wei S, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 1302–1310

  9. 9.

    Chen C, Jafari R, Kehtarnavaz N (2017) A survey of depth and inertial sensor fusion for human action recognition. Multimed Tools Appl 76(3):4405–4425

    Article  Google Scholar 

  10. 10.

    Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: 2015 IEEE conference on computer vision and pattern recognition, pp 1110–1118

  11. 11.

    Elias P, Sedmidubsky J, Zezula P (2017) A real-time annotation of motion data streams. In: 19th International symposium on multimedia. IEEE Computer Society, pp 154–161

  12. 12.

    Evangelidis G, Singh G, Horaud R (2014) Skeletal quads: human action recognition using joint quadruples. In: 22nd International conference on pattern recognition (ICPR 2014), pp 4513–4518

  13. 13.

    Field M, Stirling D, Pan Z, Ros M, Naghdy F (2015) Recognizing human motions through mixture modeling of inertial data. Pattern Recognit 48(8):2394–2406

    Article  Google Scholar 

  14. 14.

    Fothergill S, Mentis H, Kohli P, Nowozin S (2012) Instructing people for training gestural interactive systems. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’12. ACM, New York, pp 1737–1746

  15. 15.

    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  16. 16.

    Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. In: Joint conference on artificial intelligence (IJCAI 2013), pp 2466–2472

  17. 17.

    Jain A, Zamir AR, Savarese S, Saxena A (2016) Structural-rnn: deep learning on spatio-temporal graphs. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 5308–5317

  18. 18.

    Kadu H, Kuo CCJ (2014) Automatic human mocap data classification. IEEE Trans Multimedia 16(8):2191–2202

    Article  Google Scholar 

  19. 19.

    Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412:6980

  20. 20.

    Kratz L, Smith M, Lee F (2007) Wiizards: 3d gesture recognition for game play input. In: Proceedings of the 2007 conference on future play. Future play ’07, pp 209–212

  21. 21.

    Krüger B, Vögele A, Willig T, Yao A, Klein R, Weber A (2017) Efficient unsupervised temporal segmentation of motion data. IEEE Trans Multimedia 19(4):797–812

    Article  Google Scholar 

  22. 22.

    Lakens D (2010) Movement synchrony and perceived entitativity. J Exp Soc Psychol 46(5):701–708

    Article  Google Scholar 

  23. 23.

    Laraba S, Brahimi M, Tilmanne J, Dutoit T (2017) 3d skeleton-based action recognition by representing motion capture sequences as 2d-rgb images. Comput Anim Virtual Worlds 28(3–4)

  24. 24.

    Li Y, Lan C, Xing J, Zeng W, Yuan C, Liu J (2016) Online human action detection using joint classification-regression recurrent neural networks. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision—ECCV 2016. Springer International Publishing, Cham, pp 203–220

  25. 25.

    Li K, He FZ, Yu HP, Chen X (2017) A correlative classifiers approach based on particle filter and sample set for tracking occluded target. Appl Math–A Journal of Chinese Universities 32(3):294–312

    MathSciNet  Article  Google Scholar 

  26. 26.

    Li K, He FZ, Yu HP (2018) Robust visual tracking based on convolutional features with illumination and occlusion handing. J Comput Sci Technol 33(1):223–236

    Article  Google Scholar 

  27. 27.

    Li S, Li K, Fu Y (2018) Early recognition of 3d human actions. ACM Trans Multimedia Comput Commun Appl 14(1s):20:1–20:21

    Article  Google Scholar 

  28. 28.

    Liu J, Wang G, Duan L, Hu P, Kot AC (2018) Skeleton based human action recognition with global context-aware attention LSTM networks. IEEE Trans Image Process 27(4):1586–1599

    MathSciNet  Article  MATH  Google Scholar 

  29. 29.

    Ma S, Sigal L, Sclaroff S (2016) Learning activity progression in lstms for activity detection and early detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 1942–1950

  30. 30.

    Müller M, Röder T, Clausen M, Eberhardt B, Krüger B, Weber A (2007) Documentation Mocap Database HDM05. Tech. Rep. CG-2007-2, Universität Bonn

  31. 31.

    Müller M, Baak A, Seidel HP (2009) Efficient and robust annotation of motion capture data. In: ACM SIGGRAPH/Eurographics symposium on computer animation (SCA 2009). ACM Press, pp 17–26

  32. 32.

    Nunez JC, Cabido R, Pantrigo JJ, Montemayor AS, Velez JF (2018) Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recogn 76:80–94

    Article  Google Scholar 

  33. 33.

    Poppe R, Van Der Zee S, Heylen DKJ, Taylor PJ (2014) Amab: automated measurement and analysis of body motion. Behav Res Methods 46(3):625–633

    Google Scholar 

  34. 34.

    Raptis M, Kirovski D, Hoppe H (2011) Real-time classification of dance gestures from skeleton animation. In: ACM SIGGRAPH Eurographics symposium on computer animation (SCA 2011), SCA 2011. ACM, pp 147–156

  35. 35.

    Sedmidubsky J, Elias P, Zezula P (2018) Effective and efficient similarity searching in motion capture data. Multimed Tools Appl 77(10):12,073–12,094

    Article  Google Scholar 

  36. 36.

    Singh D, Merdivan E, Psychoula I, Kropf J, Hanke S, Geist M, Holzinger A (2017) Human activity recognition using recurrent neural networks. In: Holzinger A, Kieseberg P, Tjoa AM, Weippl E (eds) Machine learning and knowledge extraction. Springer International Publishing, Cham, pp 267–274

  37. 37.

    Song S, Lan C, Xing J, Zeng W, Liu J (2018) Spatio-temporal attention-based lstm networks for 3d action recognition and detection. IEEE Trans Image Process 27(7):3459–3471

    MathSciNet  Article  MATH  Google Scholar 

  38. 38.

    Vieira A, Lewiner T, Schwartz W, Campos M (2012) Distance matrices as invariant features for classifying mocap data. In: 21st International conference on pattern recognition (ICPR 2012), pp 2934–2937

  39. 39.

    Wang Y, Neff M (2015) Deep signatures for indexing and retrieval in large motion databases. In: 8th ACM SIGGRAPH conference on motion in games. ACM, pp 37–45

  40. 40.

    Wang C, Wang Y, Yuille AL (2013) An approach to pose-based action recognition. In: Proceedings of the 2013 IEEE conference on computer vision and pattern recognition, CVPR ’13. IEEE Computer Society, pp 915–922

  41. 41.

    Wu D, Shao L (2014) Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In: 2014 IEEE conference on computer vision and pattern recognition, pp 724–731

  42. 42.

    Xia L, Chen CC, Aggarwal JK (2012) View invariant human action recognition using histograms of 3D joints. In: CVPR workshops, pp 20–27

  43. 43.

    Xu Y, Shen Z, Zhang X, Gao Y, Deng S, Wang Y, Fan Y, Chang EC (2017) Learning multi-level features for sensor-based human action recognition. Pervasive Mob Comput 40:324–338

    Article  Google Scholar 

  44. 44.

    Yu X, Liu W, Xing W (2017) Behavioral segmentation for human motion capture data based on graph cut method. J Vis Lang Comput 43:50–59

    Article  Google Scholar 

  45. 45.

    Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: an efficient 3d kinematics descriptor for low-latency action recognition and detection. In: International conference on computer vision (ICCV 2013), pp 2752–2759

  46. 46.

    Zhao X, Li X, Pang C, Sheng QZ, Wang S, Ye M (2014) Structured streaming skeleton—a new feature for online human gesture recognition. ACM Trans Multimedia Comput Commun Appl 11(1s):22:1–22:18

    Article  Google Scholar 

  47. 47.

    Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: 30th AAAI conference on artificial intelligence, AAAI 2016. AAAI Press, pp 3697–3703

Download references


This research was supported by Smart News, “Social sensing for breaking news”, CUP CIPE D58C15000270008, by Automatic Data and documents Analysis to enhance human-based processes (ADA), CUP CIPE D55F17000290009, and by ERDF “CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence” (No. CZ.02.1.01/0.0/0.0/16_019/0000822). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.

Author information



Corresponding author

Correspondence to Fabio Carrara.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Carrara, F., Elias, P., Sedmidubsky, J. et al. LSTM-based real-time action detection and prediction in human motion streams. Multimed Tools Appl 78, 27309–27331 (2019).

Download citation


  • Motion capture data
  • Stream annotation
  • Action detection and recognition
  • Action prediction
  • LSTM