Skip to main content
Log in

A framework for interpreting, modeling and recognizing human body gestures through 3D eigenpostures

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

In this article we propose a novel system for recognizing human gestures through acquisition and processing of volumetric data sequences. Volumetric sequences are acquired with two different approaches, a multi-camera set-up and a multi-Kinect\(^\mathrm{TM}\) set-up. The recognition based on volumetric representation does not require any skeleton fitting or limb tracking and the system relies on the extraction of robust features directly from the available 3D data. Volumetric shape descriptors are, in fact, invariant with respect to viewpoint and body size; they are designed to provide us with a unique signature for each posture. Hidden Markov Models (HMMs), trained on different gestures, are then used for identifying a set of key postures and classifying their sequences over a set of possible actions. The paper also presents a method for identifying the number of hidden states of the HMMs that describe gestures. Despite its implementation and conceptual simplicity, the number of states that we estimate with this method turns out to match that of other classical approaches such as the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC). The same approach is also applied in the definition of the Gaussian Mixture for the Hidden states Observations providing us with good results. Extensive tests were performed on a database that we acquired, which is made of ten different actions, each performed by five different actors and in five different ways (different speed and orientation) and on another public database, achieving a 96% correct recognition rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

Notes

  1. http://www.marcon.net\(\rightarrow\) projects \(\rightarrow\) volumetric gesture recognition.

  2. http://www.r-project.org.

  3. http://cran.r-project.org/web/packages/R.matlab.

References

  1. (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666

  2. Aggarwal JK, Cai Q (1997) Human motion analysis: a review. In: Proc. of IEEE Int. Worksh. on Nonrigid and Articul. Motion, San Juan, Puerto Rico, 16 June 1997, pp 90–102,

  3. Aggarwal JK, Park S (2004) Human motion: modeling and recognition of actions and interactions. In: Proc. of IEEE Int. Symp. on 3D Data Proc. Visual. and Transm. (3DPVT’04), Thessaloniki, Greece, 6–9 Sep 2004, pp 640–647

  4. Ahmad M, Lee S-W (2006) Human action recognition using multi-view image sequences features. In: Proc. of IEEE Int. Conf. on Autom. Face and Gest. Rec. (FGR’06), Southampton, UK, 10–12 April 2006, pp 523–528,

  5. Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267

    Article  Google Scholar 

  6. Brand M, Oliver N, Pentland A (1997) Coupled hidden markov models for complex action recognition. In: Proc. of IEEE Int. Conf. on Comp. Vision and Patt. Rec. (CVPR’97), San Juan, Puerto Rico, 17–19 June 1997, pp 994–999

  7. Burnham K, Anderson D (2004) Multimodel inference: understanding aic and bic in model selection. Sociol Methods Res 33:261–304

    Article  MathSciNet  Google Scholar 

  8. Cai Q, Aggarwal J (1999) Tracking human motion using a distributed-camera system in structured environments. IEEE Trans Pattern Anal Mach Intell 21(12):1241–1247

    Article  Google Scholar 

  9. Cai Q, Aggarwal JK (1996) Tracking human motion using multiple cameras. In: Proc. of IEEE Int. Conf. on Patt. Rec. (ICPR’96), Vienna, Austria, 25–29 Aug 1996, pp 68–72

  10. Cappé O, Moulines E, Rydén T (2005) Inference in hidden Markov models. Springer, New York

    MATH  Google Scholar 

  11. Chen L, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:abs/1706.05587

  12. Cheung GK, Kanade T, Bouguet J-Y, Holler M (2000) A real time system for robust 3d voxel reconstruction of human motions. In: Proc. of IEEE Int. Conf. on Comp. Vision and Patt. Rec. (CVPR’96), Hilton Head Island, SC, USA, 13–15 June 2000, pp 714–720

  13. Wren C, Azarbayejani A, Darrell T, Pentland A (1997) Pfinder: real-time tracking of the human body. IEEE Trans Pattern Anal Mach Intell 19:780–785

    Article  Google Scholar 

  14. de Aguiar E, Theobalt C, Magnor M, Theisel H, Seidel H-P (2005) M\(^{3}\): Marker-free model reconstruction and motion tracking from 3d voxel data. In: Proc. of IEEE Pacific Conf. on Comp. Graph. and Appl. (PG’04), Seoul, Korea, 6–8 Oct 2005, pp 101—110

  15. Dockstader SL, Tekalp AM (2001) Multiple camera tracking of interacting and occluded human motion. Proc IEEE 89(10):1441–1455

    Article  Google Scholar 

  16. Duda R, Hart P (1974) Pattern classification and scene analysis. Wiley, New York

    MATH  Google Scholar 

  17. (2006) Pattern Recognit Lett An introduction to ROC analysis. 27(8):861–874

  18. Gavrila D, Davis L (1996) 3-d model-based tracking of humans in action: a multi-view approach. In: Proc. of IEEE Int. Conf. on Comp. Vision and Patt. Rec. (CVPR’96), San Francisco, CA, USA, 18–20 June 1996, pp 73–80,

  19. Gavrila DM (1999) The visual analysis of human movement: a survey. Comput Vis Image Underst 73(1):82–98

    Article  MATH  Google Scholar 

  20. Gkalelis N, Kim H, Hilton A, Nikolaidis N, Pitas I (2009) The i3dpost multi-view and 3d human action/interaction database. In: Conference for Visual Media Production, Proc. of

  21. Grau O, Pullen T, Thomas GA (2004) A combined studio production system for 3-d capturing of live action and immersive actor feedback. IEEE Trans Circuits Syst Video Technol 14(3):370–380

    Article  Google Scholar 

  22. Grest D, Woetzel J, Koch R (2005) Nonlinear body pose estimation from depth images. In. In Proc, DAGM

  23. Guerra-Filho G, Aloimonos Y (2007) A language for human action. Computer 40(5):42–51

    Article  Google Scholar 

  24. Huang KS, Trivedi MM (2007) 3d shape context based gesture analysis integrated with tracking using omni video array. In: Proc. of IEEE Works. on Vis. for Hum.-Comp. Inter. (V4HCI), in conjunction with IEEE Conf. on Comp. Vis. and Patt. Rec. (CVPR’05), San Diego, CA, USA, 20–25 June 2007

  25. Hummels C, Stappers PJ (1998) Meaningful gestures for human computer interaction: beyond hand postures. In: Proc. IEEE international conference on automatic face and gesture recognition (FG’98), Nara, Japan, 14–16 April 1998, pp 591–596

  26. Hwang B-W, Kim S, Lee S-W (2006) A full-body gesture database for automatic gesture recognition. In: Proc. of IEEE Int. Conf. on Autom. Face and Gest. Rec. (FGR’06), Southampton, UK, 10–12 April 2006, pp 243–248

  27. Jolliffe I (2002) Principal component analysis. Springer series in statistics, 2nd edn. Springer, New York

    Google Scholar 

  28. Junejo IN, Dexter E, Laptev I, Perez P (2011) View-independent action recognition from temporal self-similarities. IEEE Trans Pattern Anal Mach Intell 33:172–185

    Article  Google Scholar 

  29. Kahol K, Tripathi P, Panchanathan S (2006) Documenting motion sequences with a personalized annotation system. IEEE Multimed 13(1):37–45

    Article  Google Scholar 

  30. Kakadiaris I, Metaxas D (2000) Model-based estimation of 3d human motion. IEEE Trans Pattern Anal Mach Intell 22(12):1453–1459

    Article  Google Scholar 

  31. Kakadiaris IA, Metaxas D (1995). 3d human body model acquisition from multiple views. In: Proc. of IEEE Int. Conf. on Comp. Vision (ICCV’95), Boston, MA, 20–23 June 1995, pp 618–623

  32. Kelly PH, Katkere A, Kuramura DY, Moezzi S, Chatterjee S, Jain R (1995) An architecture for multiple perspective interactive video. In: Proc. of ACM Int. Conf. on Multim., San Francisco, CA, USA, 5–9 Nov 1995, pp 201–212

  33. Kutulakos KN, Seitz SM (2000) A theory of shape by space carving. Int J Comput Vis 38(3):199–218

    Article  MATH  Google Scholar 

  34. Laurentini A (1994) The visual hull concept for silhouette-based image understanding. IEEE Trans Pattern Anal Mach Intell 16:150–162

    Article  Google Scholar 

  35. Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  36. LeCun Y, Cortes C (2010) MNIST handwritten digit database

  37. Li G, Ren P, Lyu X, Zhang H (Dec 2016). Real-time top-view people counting based on a kinect and nvidia jetson tk1 integrated platform. In: 2016 IEEE 16th international conference on data mining workshops (ICDMW), pp 468–473

  38. Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal LSTM with trust gates for 3d human action recognition. arXiv:abs/1607.07043

  39. Maitra R (2009) Initializing partition-optimization algorithms. IEEE/ACM Trans Comput Biol Bioinf 6(1):144–157

    Article  Google Scholar 

  40. Marcon M, Frigerio E, Sarti A, Tubaro S (2012) 3d correspondences in textured depth-maps through planar similarity transform. In: IEEE emerging signal processing applications, Int. conf

  41. Marcon M, Frigerio E, Tubaro S, Sarti A (2012) 3d wide baseline correspondences using depth-maps. Sign Process Image Commun 27:849–855

    Article  Google Scholar 

  42. Microsoft (2012) Kinect sdk: http://www.microsoft.com/en-us/kinectforwindows/develop/

  43. Moeslund TB, Granum E (2001) A survey of computer vision-based human motion capture. Comput Vis Image Underst 81(3):231–268

    Article  MATH  Google Scholar 

  44. Nespoulous JL, Perron P, Lecours AR (1986) The biological foundations of gestures: motor and semiotic aspects. Lawrence Erlbaum Associates, New Jersey

    Google Scholar 

  45. OToole AJ, Harms J, Snow SL, Hurst DR (2005) A video database of moving faces and people. IEEE Trans Pattern Anal Mach Intell 27(5):812–816

    Article  Google Scholar 

  46. Pentland AP (1996) Smart rooms. Sci Am 247(4):54–62

    Google Scholar 

  47. Peterson AD, Ghosh AP, Maitra R (2010) A systematic evaluation of different methods for initializing the k-means clustering algorithm. In: Knowledge creation diffusion utilization, pp 1–11

  48. Pham TTD, Nguyen HT, Lee S, Won CS (Oct 2016). Moving object detection with kinect v2. In: 2016 IEEE international conference on consumer electronics-Asia (ICCE-Asia), pp 1–4

  49. Picone JW (1993) Signal modeling techniques in speech recognition. Proc IEEE 81(9):1215–1247

    Article  Google Scholar 

  50. Plagemann C, Ganapathi V, Koller D, Thrun S (2010) Real-time identification and localization of body parts from depth images. In: Proc, ICRA

  51. Polana R, Nelson R (1994) Low level recognition of human motion (or how to get your man without finding his body parts). In: Proc. of IEEE Worksh. on Mot. of Non-Rigid and Artic. Obj. (NAM’94), Austin, Texas, USA, 11–12 Nov 1994, pp 77–82

  52. Rabiner LR (1989) A tutorial on hidden markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286

    Article  Google Scholar 

  53. Reng L, Moeslund TB, Granum E (2005) Finding motion primitives in human body gestures. In: Proc. of Int. Worksh. on Gest. in Hum.-Comp. Interact. and Sim. (GW’05), Berder, France, 18–20 May 2005, pp 133–144

  54. Sha Y, Shi P, Pan D, Zhou S (2016) Human pose estimation combined with depth information. In: 2016 IEEE advanced information management, communicates, electronic and automation control conference (IMCEC), pp 663–667

  55. Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from a single depth image. In: Proceeding of internation conference on computer vision and pattern recognition

  56. Soleimani V, Mirmehdi M, Damen D, Hannuna S, Camplani M (2016) 3d data acquisition and registration using two opposing kinects. In: 2016 fourth international conference on 3D vision (3DV), pp 128–137

  57. Starck J, Hilton A (2007) Surface capture for performance based animation. IEEE Comput Graph Appl 27(3):21–31

    Article  Google Scholar 

  58. Stoll PA, Ohya J (1995) Applications of hmm modeling to recognizing human gestures in image sequences for a man-machine interface. In: Proc. of IEEE Int. Works. on Robot and Hum. Comm. (RO-MAN’95), Tokyo, JAPAN, 5–7 July 1995, pp 129–134

  59. Sundaresan A, Chellappa R (2005) Markerless motion capture using multiple cameras. In: Proc. of IEEE Comp. Vis. for Inter. and Intell. Env. (CVIIE’05), Lexington, KY, USA, 17–18 Nov 2005, pp 15—26

  60. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV), pp 4489–4497

  61. Trivedi MM, Huang KS, Mikic I (2005) Dynamic context capture and distributed video arrays for intelligent spaces. IEEE Trans Syst Man Cybern Part A Syst Hum 35(1):145–163

    Article  Google Scholar 

  62. Wasserman L (2000) Bayesian model selection and model averaging. J Math Psychol 44(1):92–107

    Article  MathSciNet  MATH  Google Scholar 

  63. Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2–3):249–257

    Article  Google Scholar 

  64. Welch L (2003) Hidden markov models and the baum-welch algorithm. In: Prez LC (ed) IEEE information theory society newsletter, vol 53, pp 1, 10–13

  65. Yamato J, Ohya J, Ishii K (1992) Recognizing human action in time-sequential images using hidden markov model. In: Proc. of IEEE Int. Conf. on Comp. Vision and Patt. Rec. (CVPR’92), Champaign, IL, USA, 15–18 June 1992, pp 379–385

  66. Yang H-D, Park A-Y, Lee S-W (2007) Gesture spotting and recognition for humanrobot interaction. EEE Trans Robot 23(2):256–270

    Article  Google Scholar 

  67. Yu G, Yuan J, Liu Z (2012) Propagative hough voting for human activity recognition. Springer, Berlin, pp 693–706

    Google Scholar 

  68. Zhang D, Gatica-Perez D, Bengio S, McCowan I (2006) Modeling individual and group actions in meetings with layered hmms. IEEE Trans Multimed 8(3):509–520

    Article  Google Scholar 

  69. Zhao H, Shi J, Qi X, Wang X, Jia J (2016) Pyramid scene parsing network. arXiv:abs/1612.01105

  70. Zucchini W, MacDonald IL (2008) Hidden Markov models for time series. Chapman & Hall-CRC, Boca Raton

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Marcon.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Marcon, M., Paracchini, M.B.M. & Tubaro, S. A framework for interpreting, modeling and recognizing human body gestures through 3D eigenpostures. Int. J. Mach. Learn. & Cyber. 10, 1205–1226 (2019). https://doi.org/10.1007/s13042-018-0801-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-018-0801-1

Keywords

Navigation