A framework for interpreting, modeling and recognizing human body gestures through 3D eigenpostures

  • Marco Marcon
  • Marco Brando Mario Paracchini
  • Stefano Tubaro
Original Article


In this article we propose a novel system for recognizing human gestures through acquisition and processing of volumetric data sequences. Volumetric sequences are acquired with two different approaches, a multi-camera set-up and a multi-Kinect\(^\mathrm{TM}\) set-up. The recognition based on volumetric representation does not require any skeleton fitting or limb tracking and the system relies on the extraction of robust features directly from the available 3D data. Volumetric shape descriptors are, in fact, invariant with respect to viewpoint and body size; they are designed to provide us with a unique signature for each posture. Hidden Markov Models (HMMs), trained on different gestures, are then used for identifying a set of key postures and classifying their sequences over a set of possible actions. The paper also presents a method for identifying the number of hidden states of the HMMs that describe gestures. Despite its implementation and conceptual simplicity, the number of states that we estimate with this method turns out to match that of other classical approaches such as the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC). The same approach is also applied in the definition of the Gaussian Mixture for the Hidden states Observations providing us with good results. Extensive tests were performed on a database that we acquired, which is made of ten different actions, each performed by five different actors and in five different ways (different speed and orientation) and on another public database, achieving a 96% correct recognition rate.


  1. 1.
    (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666Google Scholar
  2. 2.
    Aggarwal JK, Cai Q (1997) Human motion analysis: a review. In: Proc. of IEEE Int. Worksh. on Nonrigid and Articul. Motion, San Juan, Puerto Rico, 16 June 1997, pp 90–102,Google Scholar
  3. 3.
    Aggarwal JK, Park S (2004) Human motion: modeling and recognition of actions and interactions. In: Proc. of IEEE Int. Symp. on 3D Data Proc. Visual. and Transm. (3DPVT’04), Thessaloniki, Greece, 6–9 Sep 2004, pp 640–647Google Scholar
  4. 4.
    Ahmad M, Lee S-W (2006) Human action recognition using multi-view image sequences features. In: Proc. of IEEE Int. Conf. on Autom. Face and Gest. Rec. (FGR’06), Southampton, UK, 10–12 April 2006, pp 523–528,Google Scholar
  5. 5.
    Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267CrossRefGoogle Scholar
  6. 6.
    Brand M, Oliver N, Pentland A (1997) Coupled hidden markov models for complex action recognition. In: Proc. of IEEE Int. Conf. on Comp. Vision and Patt. Rec. (CVPR’97), San Juan, Puerto Rico, 17–19 June 1997, pp 994–999Google Scholar
  7. 7.
    Burnham K, Anderson D (2004) Multimodel inference: understanding aic and bic in model selection. Sociol Methods Res 33:261–304MathSciNetCrossRefGoogle Scholar
  8. 8.
    Cai Q, Aggarwal J (1999) Tracking human motion using a distributed-camera system in structured environments. IEEE Trans Pattern Anal Mach Intell 21(12):1241–1247CrossRefGoogle Scholar
  9. 9.
    Cai Q, Aggarwal JK (1996) Tracking human motion using multiple cameras. In: Proc. of IEEE Int. Conf. on Patt. Rec. (ICPR’96), Vienna, Austria, 25–29 Aug 1996, pp 68–72Google Scholar
  10. 10.
    Cappé O, Moulines E, Rydén T (2005) Inference in hidden Markov models. Springer, New YorkzbMATHGoogle Scholar
  11. 11.
    Chen L, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:abs/1706.05587
  12. 12.
    Cheung GK, Kanade T, Bouguet J-Y, Holler M (2000) A real time system for robust 3d voxel reconstruction of human motions. In: Proc. of IEEE Int. Conf. on Comp. Vision and Patt. Rec. (CVPR’96), Hilton Head Island, SC, USA, 13–15 June 2000, pp 714–720Google Scholar
  13. 13.
    Wren C, Azarbayejani A, Darrell T, Pentland A (1997) Pfinder: real-time tracking of the human body. IEEE Trans Pattern Anal Mach Intell 19:780–785CrossRefGoogle Scholar
  14. 14.
    de Aguiar E, Theobalt C, Magnor M, Theisel H, Seidel H-P (2005) M\(^{3}\): Marker-free model reconstruction and motion tracking from 3d voxel data. In: Proc. of IEEE Pacific Conf. on Comp. Graph. and Appl. (PG’04), Seoul, Korea, 6–8 Oct 2005, pp 101—110Google Scholar
  15. 15.
    Dockstader SL, Tekalp AM (2001) Multiple camera tracking of interacting and occluded human motion. Proc IEEE 89(10):1441–1455CrossRefzbMATHGoogle Scholar
  16. 16.
    Duda R, Hart P (1974) Pattern classification and scene analysis. Wiley, New YorkzbMATHGoogle Scholar
  17. 17.
    (2006) Pattern Recognit Lett An introduction to ROC analysis. 27(8):861–874Google Scholar
  18. 18.
    Gavrila D, Davis L (1996) 3-d model-based tracking of humans in action: a multi-view approach. In: Proc. of IEEE Int. Conf. on Comp. Vision and Patt. Rec. (CVPR’96), San Francisco, CA, USA, 18–20 June 1996, pp 73–80,Google Scholar
  19. 19.
    Gavrila DM (1999) The visual analysis of human movement: a survey. Comput Vis Image Underst 73(1):82–98CrossRefzbMATHGoogle Scholar
  20. 20.
    Gkalelis N, Kim H, Hilton A, Nikolaidis N, Pitas I (2009) The i3dpost multi-view and 3d human action/interaction database. In: Conference for Visual Media Production, Proc. ofGoogle Scholar
  21. 21.
    Grau O, Pullen T, Thomas GA (2004) A combined studio production system for 3-d capturing of live action and immersive actor feedback. IEEE Trans Circuits Syst Video Technol 14(3):370–380CrossRefGoogle Scholar
  22. 22.
    Grest D, Woetzel J, Koch R (2005) Nonlinear body pose estimation from depth images. In. In Proc, DAGMGoogle Scholar
  23. 23.
    Guerra-Filho G, Aloimonos Y (2007) A language for human action. Computer 40(5):42–51CrossRefGoogle Scholar
  24. 24.
    Huang KS, Trivedi MM (2007) 3d shape context based gesture analysis integrated with tracking using omni video array. In: Proc. of IEEE Works. on Vis. for Hum.-Comp. Inter. (V4HCI), in conjunction with IEEE Conf. on Comp. Vis. and Patt. Rec. (CVPR’05), San Diego, CA, USA, 20–25 June 2007Google Scholar
  25. 25.
    Hummels C, Stappers PJ (1998) Meaningful gestures for human computer interaction: beyond hand postures. In: Proc. IEEE international conference on automatic face and gesture recognition (FG’98), Nara, Japan, 14–16 April 1998, pp 591–596Google Scholar
  26. 26.
    Hwang B-W, Kim S, Lee S-W (2006) A full-body gesture database for automatic gesture recognition. In: Proc. of IEEE Int. Conf. on Autom. Face and Gest. Rec. (FGR’06), Southampton, UK, 10–12 April 2006, pp 243–248Google Scholar
  27. 27.
    Jolliffe I (2002) Principal component analysis. Springer series in statistics, 2nd edn. Springer, New YorkzbMATHGoogle Scholar
  28. 28.
    Junejo IN, Dexter E, Laptev I, Perez P (2011) View-independent action recognition from temporal self-similarities. IEEE Trans Pattern Anal Mach Intell 33:172–185CrossRefGoogle Scholar
  29. 29.
    Kahol K, Tripathi P, Panchanathan S (2006) Documenting motion sequences with a personalized annotation system. IEEE Multimed 13(1):37–45CrossRefGoogle Scholar
  30. 30.
    Kakadiaris I, Metaxas D (2000) Model-based estimation of 3d human motion. IEEE Trans Pattern Anal Mach Intell 22(12):1453–1459CrossRefGoogle Scholar
  31. 31.
    Kakadiaris IA, Metaxas D (1995). 3d human body model acquisition from multiple views. In: Proc. of IEEE Int. Conf. on Comp. Vision (ICCV’95), Boston, MA, 20–23 June 1995, pp 618–623Google Scholar
  32. 32.
    Kelly PH, Katkere A, Kuramura DY, Moezzi S, Chatterjee S, Jain R (1995) An architecture for multiple perspective interactive video. In: Proc. of ACM Int. Conf. on Multim., San Francisco, CA, USA, 5–9 Nov 1995, pp 201–212Google Scholar
  33. 33.
    Kutulakos KN, Seitz SM (2000) A theory of shape by space carving. Int J Comput Vis 38(3):199–218CrossRefzbMATHGoogle Scholar
  34. 34.
    Laurentini A (1994) The visual hull concept for silhouette-based image understanding. IEEE Trans Pattern Anal Mach Intell 16:150–162CrossRefGoogle Scholar
  35. 35.
    Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRefGoogle Scholar
  36. 36.
    LeCun Y, Cortes C (2010) MNIST handwritten digit databaseGoogle Scholar
  37. 37.
    Li G, Ren P, Lyu X, Zhang H (Dec 2016). Real-time top-view people counting based on a kinect and nvidia jetson tk1 integrated platform. In: 2016 IEEE 16th international conference on data mining workshops (ICDMW), pp 468–473Google Scholar
  38. 38.
    Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal LSTM with trust gates for 3d human action recognition. arXiv:abs/1607.07043
  39. 39.
    Maitra R (2009) Initializing partition-optimization algorithms. IEEE/ACM Trans Comput Biol Bioinf 6(1):144–157CrossRefGoogle Scholar
  40. 40.
    Marcon M, Frigerio E, Sarti A, Tubaro S (2012) 3d correspondences in textured depth-maps through planar similarity transform. In: IEEE emerging signal processing applications, Int. confGoogle Scholar
  41. 41.
    Marcon M, Frigerio E, Tubaro S, Sarti A (2012) 3d wide baseline correspondences using depth-maps. Sign Process Image Commun 27:849–855CrossRefGoogle Scholar
  42. 42.
  43. 43.
    Moeslund TB, Granum E (2001) A survey of computer vision-based human motion capture. Comput Vis Image Underst 81(3):231–268CrossRefzbMATHGoogle Scholar
  44. 44.
    Nespoulous JL, Perron P, Lecours AR (1986) The biological foundations of gestures: motor and semiotic aspects. Lawrence Erlbaum Associates, New JerseyGoogle Scholar
  45. 45.
    OToole AJ, Harms J, Snow SL, Hurst DR (2005) A video database of moving faces and people. IEEE Trans Pattern Anal Mach Intell 27(5):812–816CrossRefGoogle Scholar
  46. 46.
    Pentland AP (1996) Smart rooms. Sci Am 247(4):54–62Google Scholar
  47. 47.
    Peterson AD, Ghosh AP, Maitra R (2010) A systematic evaluation of different methods for initializing the k-means clustering algorithm. In: Knowledge creation diffusion utilization, pp 1–11Google Scholar
  48. 48.
    Pham TTD, Nguyen HT, Lee S, Won CS (Oct 2016). Moving object detection with kinect v2. In: 2016 IEEE international conference on consumer electronics-Asia (ICCE-Asia), pp 1–4Google Scholar
  49. 49.
    Picone JW (1993) Signal modeling techniques in speech recognition. Proc IEEE 81(9):1215–1247CrossRefGoogle Scholar
  50. 50.
    Plagemann C, Ganapathi V, Koller D, Thrun S (2010) Real-time identification and localization of body parts from depth images. In: Proc, ICRAGoogle Scholar
  51. 51.
    Polana R, Nelson R (1994) Low level recognition of human motion (or how to get your man without finding his body parts). In: Proc. of IEEE Worksh. on Mot. of Non-Rigid and Artic. Obj. (NAM’94), Austin, Texas, USA, 11–12 Nov 1994, pp 77–82Google Scholar
  52. 52.
    Rabiner LR (1989) A tutorial on hidden markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286CrossRefGoogle Scholar
  53. 53.
    Reng L, Moeslund TB, Granum E (2005) Finding motion primitives in human body gestures. In: Proc. of Int. Worksh. on Gest. in Hum.-Comp. Interact. and Sim. (GW’05), Berder, France, 18–20 May 2005, pp 133–144Google Scholar
  54. 54.
    Sha Y, Shi P, Pan D, Zhou S (2016) Human pose estimation combined with depth information. In: 2016 IEEE advanced information management, communicates, electronic and automation control conference (IMCEC), pp 663–667Google Scholar
  55. 55.
    Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from a single depth image. In: Proceeding of internation conference on computer vision and pattern recognitionGoogle Scholar
  56. 56.
    Soleimani V, Mirmehdi M, Damen D, Hannuna S, Camplani M (2016) 3d data acquisition and registration using two opposing kinects. In: 2016 fourth international conference on 3D vision (3DV), pp 128–137Google Scholar
  57. 57.
    Starck J, Hilton A (2007) Surface capture for performance based animation. IEEE Comput Graph Appl 27(3):21–31CrossRefGoogle Scholar
  58. 58.
    Stoll PA, Ohya J (1995) Applications of hmm modeling to recognizing human gestures in image sequences for a man-machine interface. In: Proc. of IEEE Int. Works. on Robot and Hum. Comm. (RO-MAN’95), Tokyo, JAPAN, 5–7 July 1995, pp 129–134Google Scholar
  59. 59.
    Sundaresan A, Chellappa R (2005) Markerless motion capture using multiple cameras. In: Proc. of IEEE Comp. Vis. for Inter. and Intell. Env. (CVIIE’05), Lexington, KY, USA, 17–18 Nov 2005, pp 15—26Google Scholar
  60. 60.
    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV), pp 4489–4497Google Scholar
  61. 61.
    Trivedi MM, Huang KS, Mikic I (2005) Dynamic context capture and distributed video arrays for intelligent spaces. IEEE Trans Syst Man Cybern Part A Syst Hum 35(1):145–163CrossRefGoogle Scholar
  62. 62.
    Wasserman L (2000) Bayesian model selection and model averaging. J Math Psychol 44(1):92–107MathSciNetCrossRefzbMATHGoogle Scholar
  63. 63.
    Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2–3):249–257CrossRefGoogle Scholar
  64. 64.
    Welch L (2003) Hidden markov models and the baum-welch algorithm. In: Prez LC (ed) IEEE information theory society newsletter, vol 53, pp 1, 10–13Google Scholar
  65. 65.
    Yamato J, Ohya J, Ishii K (1992) Recognizing human action in time-sequential images using hidden markov model. In: Proc. of IEEE Int. Conf. on Comp. Vision and Patt. Rec. (CVPR’92), Champaign, IL, USA, 15–18 June 1992, pp 379–385Google Scholar
  66. 66.
    Yang H-D, Park A-Y, Lee S-W (2007) Gesture spotting and recognition for humanrobot interaction. EEE Trans Robot 23(2):256–270CrossRefGoogle Scholar
  67. 67.
    Yu G, Yuan J, Liu Z (2012) Propagative hough voting for human activity recognition. Springer, Berlin, pp 693–706Google Scholar
  68. 68.
    Zhang D, Gatica-Perez D, Bengio S, McCowan I (2006) Modeling individual and group actions in meetings with layered hmms. IEEE Trans Multimed 8(3):509–520CrossRefGoogle Scholar
  69. 69.
    Zhao H, Shi J, Qi X, Wang X, Jia J (2016) Pyramid scene parsing network. arXiv:abs/1612.01105
  70. 70.
    Zucchini W, MacDonald IL (2008) Hidden Markov models for time series. Chapman & Hall-CRC, Boca RatonzbMATHGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Dipartimento di Elettronica, Informazione e BioingegneriaPolitecnico di MilanoMilanItaly

Personalised recommendations