Tracking human-like natural motion by combining two deep recurrent neural networks with Kalman filter

  • 305 Accesses

  • 1 Citations


The Kinect skeleton tracker can achieve considerable performance with human body tracking in a convenient and low-cost manner. However, the tracker often captures unnatural human poses, such as discontinuous and vibrational movement when self-occlusions occur. In this study, we propose an advanced post-processing method to improve the Kinect skeleton using a single Kinect sensor, in which a combination of probabilistic filtering techniques and supervised learning techniques is employed to correct unnatural tracking movements. Specifically, two deep recurrent neural networks are used to improve joint velocities, as well as joint positions produced by the Kinect skeleton tracker. Moreover, a classic Kalman filter further refines positions and velocities. In addition, we propose a novel measure to evaluate the naturalness of captured joint trajectories. We evaluated the proposed approach by comparing it to ground truth obtained using a commercial optical maker-based motion capture system.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 99

This is the net price. Taxes to be calculated in checkout.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  1. 1.

    Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, and Blake A (2011) Real-time human pose recognition in parts from single depth images. In: International conference on computer vision and pattern recognition (CVPR)

  2. 2.

    Rumelhart D, Hinton G, Williams R (1986) Learning representations by backpropagating errors. Nature 323(6088):533–536

  3. 3.

    Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127

  4. 4.

    Goodfellow I, Warde-Farley D, Mirza M, Courville A, and Bengio Y (2013) Maxout networks. In: ICML

  5. 5.

    Le Roux N, Bengio Y (2010) Deep belief networks are compact universal approximators. Neural Comput 22(8):2192–2207

  6. 6.

    Delalleau O. and Bengio Y (2011) Shallow vs. deep sum-product networks. In: NIPS

  7. 7.

    Krizhevsky A, Sutskever, and Hinton G (2012) ImageNet classification with deep convolutional neural networks. In: NIPS

  8. 8.

    Hochreiter S, Schmidhuber J (1997) Long short-term memory? Neural Comput 9(8):1735–1780

  9. 9.

    Park S, Trivedi M (2008) Understanding human interactions with track and body synergies (TBS) captured from multiple views. Comput Vis Image Understand 111(1):2–20

  10. 10.

    Ziegler J, Nickel K, and Stiefelhagen R (2006) Tracking of the articulated upper body on multi-view stereo image sequences. In: Proceedings computer vision and pattern recognition

  11. 11.

    Hofmann M, Gavrila D (2011) Multi-view 3D human pose estimation in complex environment. Int J Comput Vis 96(1):103–124

  12. 12.

    Baak A, Muller M, Bharaj G, Seidel H.-P, and Theobalt C (2011) A data-driven approach for real-time full body pose reconstruction from a depth camera. In: ICCV, pp 1092–1099

  13. 13.

    Zhang Q, Song X, Shao X, Shibasaki R, Zhao H (2013) ‘Unsupervised skeleton extraction and motion capture from 3D deformable matching. Neurocomputing 100:170–182

  14. 14.

    Zhang L, Sturm J, Cremers D, and Lee D. (2012) Real-time human motion tracking using multiple depth cameras. In: Proceedings of the international conference on intelligent robot systems (IROS)

  15. 15.

    Liu Y, Gall J, Stoll C, Dai Q, Seidel H-P, Theobalt C (2013) Markerless motion capture of multiple characters using multi-view image segmentation. IEEE Trans Pattern Anal Mach Intell 35(11):2720–2735

  16. 16.

    Masse J-T, Lerasle F, Devy M, Monin A, Lefebvre O, Mas S (2013) Human motion capture using data fusion of multiple skeleton data. ACIVS, volume 8192 of lecture notes in computer science. Springer, Berlin, pp 126–137

  17. 17.

    Moon S, Park Y, Ko DW, Suh IH (2016) Multiple kinect sensor fusion for human skeleton tracking using Kalman filtering. Int J Adv Robot Syst 13:65

  18. 18.

    Yeung KY, Kwok TH, Wang CL (2013) Improved Skeleton tracking by duplex kinects: a practical approach for real-time applications. J Comput Inf Sci Eng 13(4):1–10

  19. 19.

    Flash T, Hogan N (1985) The coordination of arm movements: an experimentally confirmed mathematical model? J Neurosci 5(7):1688–1703

  20. 20.

    Thobbi A, Gu Y, and Sheng W (2011) Using human motion estimation for human–robot cooperative manipulation. In: IEEE/RSJ international conference on intelligent robots and systems (IROS)

  21. 21.

    Corteville B. Aertbelien E, Bruyninckx H, De Schutter J, and Van Brussel H (2007) Human-inspired robot assistant for fast point-to-point movements? In: IEEE international conference on robotics and automation

  22. 22.

    Lv F, and Nevatia R (2006) Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In: ECCV, pp 359–372

  23. 23.

    Wang Q, Kurillo G, Ofli F, and Bajcsy R (2015) Evaluation of pose tracking accuracy in the first and second generations of Microsoft Kinect. In: 2015 international conference on healthcare informatics (ICHI). IEEE

  24. 24.

    Liu DC, Nocedal J (1989) On the limited memory method for large scale optimization. Math Program B 45(3):503–528

Download references


This work was supported by the Technology Innovation Industrial Program funded by the Ministry of Trade, (MI, South Korea) [10073161 & 10048320, Technology Innovation Program], as well as by Institute for Information & communications Technology Promotion (IITP) grant funded by MSIT (No. 2018-0-00622).

Author information

Correspondence to Il Hong Suh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 7559 KB)

Supplementary material 2 (mp4 4990 KB)

Supplementary material 1 (mp4 7559 KB)

Supplementary material 2 (mp4 4990 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kim, J.B., Park, Y. & Suh, I. . Tracking human-like natural motion by combining two deep recurrent neural networks with Kalman filter. Intel Serv Robotics 11, 313–322 (2018).

Download citation


  • Human skeleton tracking
  • Deep recurrent neural network
  • Kalman filter