Multimedia Tools and Applications

, Volume 77, Issue 16, pp 21617–21652 | Cite as

Action recognition in depth videos using hierarchical gaussian descriptor

  • Xuan Son Nguyen
  • Abdel-Illah Mouaddib
  • Thanh Phuong Nguyen
  • Laurent Jeanpierre


In this paper, we propose a new approach based on distribution descriptors for action recognition in depth videos. Our local features are computed from binary patterns which incorporate the shape and motion cues for effective action recognition. Given pixel-level features, our approach estimates video local statistics in a hierarchical manner, where the distribution of pixel-level features and that of frame-level descriptors are modeled using single Gaussians. In this way, our approach constructs video descriptors directly from low-level features without resorting to codebook learning required by Bag-of-features (BoF) based approaches. In order to capture the spatial geometry and temporal order of a video, we use a spatio-temporal pyramid representation for each video. Our approach is validated on six benchmark datasets, i.e. MSRAction3D, MSRGesture3D, DHA, SKIG, UTD-MHAD and CAD-120. The experimental results show that our approach gives good performance on all the datasets. In particular, it achieves state-of-the-art accuracies on DHA, SKIG and UTD-MHAD datasets.


Human action recognition Covariance descriptor Gaussian descriptor Riemannian manifold Lie group Symmetric positive definite matrices Comparative space transform 



Portions of the research in this paper use the DHA video dataset collected by Research Center for Information Technology Innovation (CITI), Academia Sinica.


  1. 1.
    Ahonen T, Hadid A, Pietikainen M (2006) Face description with local binary patterns: application to face recognition. TPAMI 28(12):2037–2041CrossRefzbMATHGoogle Scholar
  2. 2.
    Altun K, Barshan B (2010) Human activity recognition using Inertial/Magnetic sensor units. In: Proceedings of the first international conference on human behavior understanding, pp 38–51Google Scholar
  3. 3.
    Arsigny V, Fillard P, Pennec X, Ayache N (2007) Geometric means in a novel vector space structure on symmetric Positive-Definite matrices. SIAM J Matrix Anal Appl 29(1):328–347MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Bilinski P, Bremond F (2015) Video covariance matrix logarithm for human action recognition in videos. In: IJCAI, pp 2140–2147Google Scholar
  5. 5.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefzbMATHGoogle Scholar
  6. 6.
    Cavazza J, Zunino A, Biagio MS, Murino V (2016) Kernelized covariance for action recognition. In: ICPR, pp 408–413Google Scholar
  7. 7.
    Chaaraoui AA, Padilla-Lopez JR, Florez-Revuelta F (2013) Fusion of skeletal and Silhouette-Based features for human action recognition with RGB-d devices. In: ICCVW, pp 91–97Google Scholar
  8. 8.
    Chen C, Jafari R, Kehtarnavaz N (2015) Action recognition from depth sequences using depth motion Maps-based local binary patterns. In: WACV, pp 1092–1099Google Scholar
  9. 9.
    Chen C, Jafari R, Kehtarnavaz N (2015) Improving human action recognition using fusion of depth camera and inertial sensors. IEEE Transactions on Human-Machine Systems 45(1):51– 61CrossRefGoogle Scholar
  10. 10.
    Chen C, Jafari R, Kehtarnavaz N (2015) UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: ICIP, pp 168–172Google Scholar
  11. 11.
    Chen C, Kehtarnavaz N, Jafari R (2014) A medication adherence monitoring system for pill bottles based on a wearable inertial sensor. In: 36Th annual international conference of the IEEE engineering in medicine and biology society, pp 4983–4986Google Scholar
  12. 12.
    Cirujeda P, Binefa X (2014) 4DCov: a nested covariance descriptor of Spatio-Temporal features for gesture recognition in depth sequences. In: 3DV, vol 1, pp 657–664Google Scholar
  13. 13.
    Cui J, Liu Y, Xu Y, Zhao H, Zha H (2013) Tracking generic human motion via fusion of low- and High-Dimensional approaches. IEEE Trans Syst Man Cybern Syst Hum 43(4):996– 1002CrossRefGoogle Scholar
  14. 14.
    Davis LS (2012) Covariance discriminative learning: a natural and efficient approach to image set classification. In: CVPR, pp 2496–2503Google Scholar
  15. 15.
    Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp 1110–1118Google Scholar
  16. 16.
    Ermes M, PÄrkkÄ J, MÄntyjÄrvi J, Korhonen I (2008) Detection of daily activities and sports with wearable sensors in controlled and uncontrolled conditions. IEEE Trans Inf Technol Biomed 12(1): 20–26CrossRefGoogle Scholar
  17. 17.
    Evangelidis G, Singh G, Horaud R (2014) Skeletal quads: human action recognition using joint quadruples. In: ICPR, pp 4513–4518Google Scholar
  18. 18.
    Fan KC, Hung TY (2014) A novel local pattern descriptor - local vector pattern in High-Order derivative space for face recognition. IEEE Trans Image Process 23 (7):2877–2891MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874zbMATHGoogle Scholar
  20. 20.
    Gao Z, Zhang H, Xu G, Xue Y (2015) Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition. Neurocomputing 151(Part 2):554–564CrossRefGoogle Scholar
  21. 21.
    Girshick R (2015) Fast r-CNN. In: ICCV, pp 1440–1448Google Scholar
  22. 22.
    Guo K, Ishwar P, Konrad J (2013) Action recognition from video using feature covariance matrices. IEEE Trans Image Process 22(6):2479–2494MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Harandi MT, Sanderson C, Sanin A, Lovell BC (2013) Spatio-temporal covariance descriptors for action and gesture recognition. In: WACV, pp 103–110Google Scholar
  24. 24.
    Huang Z, Wan C, Probst T, Gool LV (2017) Deep learning on lie groups for Skeleton-Based action recognition. In: CVPRGoogle Scholar
  25. 25.
    Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. In: IJCAI, pp 2466–2472Google Scholar
  26. 26.
    Koppula HS, Gupta R, Saxena A (2013) Learning human activities and object affordances from RGB-d videos. Int J Robot Res 32(8):951–970CrossRefGoogle Scholar
  27. 27.
    Koppula HS, Saxena A (2016) Anticipating human activities using object affordances for reactive robotic response. TPAMI 38(1):14–29CrossRefGoogle Scholar
  28. 28.
    Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105Google Scholar
  29. 29.
    Kurakin A, Zhang Z, Liu Z (2012) A real time system for dynamic hand gesture recognition with a depth sensor. In: EUSIPCO, pp 1975–1979Google Scholar
  30. 30.
    Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: ICML, pp 609–616Google Scholar
  31. 31.
    Lee I, Kim D, Kang S, Lee S (2017) Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In: CVPRGoogle Scholar
  32. 32.
    Li P, Wang Q (2012) local log-euclidean covariance matrix (l2ECM) for image representation and its applications. In: ECCV, pp 469–482Google Scholar
  33. 33.
    Li P, Wang Q, Zeng H, Zhang L (2017) local Log-Euclidean multivariate gaussian descriptor and its application to image classification. TPAMI 39(4):803–817CrossRefGoogle Scholar
  34. 34.
    Li Q, Stankovic JA, Hanson MA, Barth AT, Lach J, Zhou G (2009) Accurate, fast fall detection using gyroscopes and Accelerometer-Derived posture information. In: Proceedings of the sixth international workshop on wearable and implantable body sensor networks, pp 138– 143Google Scholar
  35. 35.
    Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: CVPRW, pp 9–14Google Scholar
  36. 36.
    Lin YC, Hu MC, Cheng WH, Hsieh YH, Chen HM (2012) Human action recognition and retrieval using sole depth information. In: ACM MM, pp 1053–1056Google Scholar
  37. 37.
    Liu A, Nie W, Su Y, Ma L, Hao T, Yang Z (2015) Coupled hidden conditional random fields for RGB-d human action recognition. Signal Process 112(C):74–82CrossRefGoogle Scholar
  38. 38.
    Liu C, Hu Y, Li Y, Song S, Liu J (2017) PKU-MMD: a large scale benchmark for continuous multi-modal human action understanding. arXiv:1703.07475
  39. 39.
    Liu J, Wang G, Hu P, Duan LY, Kot AC (2017) Global Context-Aware attention LSTM networks for 3D action recognition. In: CVPRGoogle Scholar
  40. 40.
    Liu L, Cheng L, Liu Y, Jia Y, Rosenblum DS (2016) Recognizing complex activities by a probabilistic interval-based model. In: AAAI, pp 1266–1272Google Scholar
  41. 41.
    Liu L, Shao L (2013) Learning discriminative representations from RGB-d video data. In: IJCAI, pp 1493–1500Google Scholar
  42. 42.
    Liu M, Liu H, Chen C (2017) 3D action recognition using multi-scale energy-based global ternary image. IEEE Transactions on Circuits and Systems for Video Technology.
  43. 43.
    Liu Y, Cui J, Zhao H, Zha H (2012) Fusion of low-and high-dimensional approaches by trackers sampling for generic human motion tracking. In: ICPR, pp 898–901Google Scholar
  44. 44.
    Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2Activity: recognizing complex activities from sensor data. In: IJCAI, pp 1617–1623Google Scholar
  45. 45.
    Liu Y, Nie L, Liu L, Rosenblum DS (2016) From action to activity: Sensor-based activity recognition. Neurocomputing 181(C):108–115CrossRefGoogle Scholar
  46. 46.
    Liu Y, Zhang X, Cui J, Wu C, Aghajan H, Zha H (2010) Visual analysis of child-adult interactive behaviors in video sequences. In: 2010 16Th international conference on virtual systems and multimedia, pp 26–33Google Scholar
  47. 47.
    Liu Z, Zhang C, Tian Y (2016) 3D-based deep convolutional neural network for action recognition with depth sequences. Image Vis Comput 55(Part 2):93–100CrossRefGoogle Scholar
  48. 48.
    Lovrić M, Min-Oo M, Ruh EA (2000) Multivariate normal distributions parametrized as a riemannian symmetric space. J Multivar Anal 74(1):36–48MathSciNetCrossRefzbMATHGoogle Scholar
  49. 49.
    Lu Y, Wei Y, Liu L, Zhong J, Sun L, Liu Y (2017) Towards unsupervised physical activity recognition using smartphone accelerometers. Multimedia Tools and Applications 76(8):10,701– 10,719CrossRefGoogle Scholar
  50. 50.
    Luo J, Wang W, Qi H (2013) Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: ICCV, pp 1809–1816Google Scholar
  51. 51.
    Matsukawa T, Okabe T, Suzuki E, Sato Y (2016) Hierarchical gaussian descriptor for person Re-identification. In: CVPR, pp 1363–1372Google Scholar
  52. 52.
    Mici L, Parisi GI, Wermter S (2017) A self-organizing neural network architecture for learning human-object interactions. arXiv:1710.01916
  53. 53.
    Mikolajczyk K, Schmid C (2005) A performance evaluation of local descriptors. TPAMI 27(10):1615–1630CrossRefGoogle Scholar
  54. 54.
    Ojala T, Pietikainen M, Harwood D (1994) Performance evaluation of texture measures with classification based on kullback discrimination of distributions. In: ICPR, vol 1, pp 582–585Google Scholar
  55. 55.
    Oreifej O, Liu Z (2013) HON4d: histogram of oriented 4D normals for activity recognition from depth sequences. In: CVPR, pp 716–723Google Scholar
  56. 56.
    Rahmani H, Bennamoun M (2017) Learning action recognition model from depth and skeleton videos. In: ICCVGoogle Scholar
  57. 57.
    Rahmani H, Mian A (2016) 3D action recognition from novel viewpoints. In: CVPR, pp 1506–1515Google Scholar
  58. 58.
    Rezazadegan F, Shirazi S, Upcroft B, Milford M (2017) Action recognition: From static datasets to moving robots. arXiv:1701.04925
  59. 59.
    Sanchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. IJCV 105(3):222–245MathSciNetCrossRefzbMATHGoogle Scholar
  60. 60.
    Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117CrossRefGoogle Scholar
  61. 61.
    Shahroudy A, Liu J, Ng TT, Wang G (2016) NTU RGB + d: a large scale dataset for 3D human activity analysis. In: CVPR, pp 1010–1019Google Scholar
  62. 62.
    Shi Z, Kim TK (2017) Learning and refining of privileged information-based RNNs for action recognition from depth sequences. In: CVPRGoogle Scholar
  63. 63.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576Google Scholar
  64. 64.
    Tuzel O, Porikli F, Meer P (2008) Pedestrian detection via classification on riemannian manifolds. TPAMI 30(10):1713–1727CrossRefGoogle Scholar
  65. 65.
    Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In: ACM MM, pp 1469–1472Google Scholar
  66. 66.
    Veeriah V, Zhuang N, Qi GJ (2015) Differential recurrent neural networks for action recognition. In: ICCV, pp 4041–4049Google Scholar
  67. 67.
    Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3D skeletons as points in a lie group. In: CVPR, pp 588–595Google Scholar
  68. 68.
    Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012) Robust 3D action recognition with random occupancy patterns. In: ECCV, pp 872–885Google Scholar
  69. 69.
    Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: CVPR, pp 1290–1297Google Scholar
  70. 70.
    Wang L, Zhang J, Zhou L, Tang C, Li W (2015) Beyond covariance: feature representation with nonlinear kernel matrices. In: ICCV, pp 4570–4578Google Scholar
  71. 71.
    Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona PO (2016) Action recognition from depth maps using deep convolutional neural networks. IEEE Transactions on Human-Machine Systems 46(4):498– 509CrossRefGoogle Scholar
  72. 72.
    Wang P, Li Z, Hou Y, Li W (2016) Action recognition based on joint trajectory maps using convolutional neural networks. In: ACM MM, pp 102–106Google Scholar
  73. 73.
    Wang P, Wang S, Gao Z, Hou Y, Li W (2017) Structured images for RGB-d action recognition. In: ICCV workshopGoogle Scholar
  74. 74.
    Wang Q, Li P, Zhang L, Zuo W (2016) Towards effective Codebookless model for image classification. Pattern Recogn 59(C):63–71CrossRefGoogle Scholar
  75. 75.
    Xia L, Aggarwal JK (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: CVPR, pp 2834–2841Google Scholar
  76. 76.
    Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. In: CVPR, pp 804–811Google Scholar
  77. 77.
    Yang X, Tian YL (2012) EigenJoints-based action recognition using Naive-Bayes-Nearest-Neighbor. In: CVPRW, pp 14–19Google Scholar
  78. 78.
    Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: ACM MM, pp 1057–1060Google Scholar
  79. 79.
    Yi Y, Wang H (2017) Motion keypoint trajectory and covariance descriptor for human action recognition. The Visual Computer.
  80. 80.
    Yu M, Liu L, Shao L (2016) Structure-Preserving binary representations for RGB-d action recognition. TPAMI 38(8):1651–1664CrossRefGoogle Scholar
  81. 81.
    Yuan C, Hu W, Li X, Maybank S, Luo G (2010) Human action recognition under log-euclidean riemannian metric. In: ACCV, pp 343–353Google Scholar
  82. 82.
    Zhang B, Yang Y, Chen C, Yang L, Han J, Shao L (2017) Action recognition using 3D histograms of texture and a Multi-Class boosting classifier. IEEE Trans Image Process 26(10):4648– 4660MathSciNetCrossRefGoogle Scholar
  83. 83.
    Zhang C, Tian Y (2015) Histogram of 3D facets. CVIU 139(C):29–39Google Scholar
  84. 84.
    Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. TPAMI 29(6):915–928CrossRefGoogle Scholar
  85. 85.
    Zhou L, Li W, Zhang Y, Ogunbona P, Nguyen DT, Zhang H (2014) Discriminative key pose extraction using extended LC-KSVD for action recognition. In: DICTA, pp 1–8Google Scholar
  86. 86.
    Zhu Y, Chen W, Guo G (2013) Fusing spatiotemporal features and joints for 3D action recognition. In: CVPRW, pp 486–491Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.CNRS, GREYCUniversité de Caen Basse-NormandieCaenFrance
  2. 2.CNRS, ENSAM, LSISAix Marseille UniversitéMarseilleFrance
  3. 3.CNRS, LSISUniversité de ToulonLa GardeFrance

Personalised recommendations