Skip to main content
Log in

Action recognition in depth videos using hierarchical gaussian descriptor

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, we propose a new approach based on distribution descriptors for action recognition in depth videos. Our local features are computed from binary patterns which incorporate the shape and motion cues for effective action recognition. Given pixel-level features, our approach estimates video local statistics in a hierarchical manner, where the distribution of pixel-level features and that of frame-level descriptors are modeled using single Gaussians. In this way, our approach constructs video descriptors directly from low-level features without resorting to codebook learning required by Bag-of-features (BoF) based approaches. In order to capture the spatial geometry and temporal order of a video, we use a spatio-temporal pyramid representation for each video. Our approach is validated on six benchmark datasets, i.e. MSRAction3D, MSRGesture3D, DHA, SKIG, UTD-MHAD and CAD-120. The experimental results show that our approach gives good performance on all the datasets. In particular, it achieves state-of-the-art accuracies on DHA, SKIG and UTD-MHAD datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Ahonen T, Hadid A, Pietikainen M (2006) Face description with local binary patterns: application to face recognition. TPAMI 28(12):2037–2041

    Article  MATH  Google Scholar 

  2. Altun K, Barshan B (2010) Human activity recognition using Inertial/Magnetic sensor units. In: Proceedings of the first international conference on human behavior understanding, pp 38–51

  3. Arsigny V, Fillard P, Pennec X, Ayache N (2007) Geometric means in a novel vector space structure on symmetric Positive-Definite matrices. SIAM J Matrix Anal Appl 29(1):328–347

    Article  MathSciNet  MATH  Google Scholar 

  4. Bilinski P, Bremond F (2015) Video covariance matrix logarithm for human action recognition in videos. In: IJCAI, pp 2140–2147

  5. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  6. Cavazza J, Zunino A, Biagio MS, Murino V (2016) Kernelized covariance for action recognition. In: ICPR, pp 408–413

  7. Chaaraoui AA, Padilla-Lopez JR, Florez-Revuelta F (2013) Fusion of skeletal and Silhouette-Based features for human action recognition with RGB-d devices. In: ICCVW, pp 91–97

  8. Chen C, Jafari R, Kehtarnavaz N (2015) Action recognition from depth sequences using depth motion Maps-based local binary patterns. In: WACV, pp 1092–1099

  9. Chen C, Jafari R, Kehtarnavaz N (2015) Improving human action recognition using fusion of depth camera and inertial sensors. IEEE Transactions on Human-Machine Systems 45(1):51– 61

    Article  Google Scholar 

  10. Chen C, Jafari R, Kehtarnavaz N (2015) UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: ICIP, pp 168–172

  11. Chen C, Kehtarnavaz N, Jafari R (2014) A medication adherence monitoring system for pill bottles based on a wearable inertial sensor. In: 36Th annual international conference of the IEEE engineering in medicine and biology society, pp 4983–4986

  12. Cirujeda P, Binefa X (2014) 4DCov: a nested covariance descriptor of Spatio-Temporal features for gesture recognition in depth sequences. In: 3DV, vol 1, pp 657–664

  13. Cui J, Liu Y, Xu Y, Zhao H, Zha H (2013) Tracking generic human motion via fusion of low- and High-Dimensional approaches. IEEE Trans Syst Man Cybern Syst Hum 43(4):996– 1002

    Article  Google Scholar 

  14. Davis LS (2012) Covariance discriminative learning: a natural and efficient approach to image set classification. In: CVPR, pp 2496–2503

  15. Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp 1110–1118

  16. Ermes M, PÄrkkÄ J, MÄntyjÄrvi J, Korhonen I (2008) Detection of daily activities and sports with wearable sensors in controlled and uncontrolled conditions. IEEE Trans Inf Technol Biomed 12(1): 20–26

    Article  Google Scholar 

  17. Evangelidis G, Singh G, Horaud R (2014) Skeletal quads: human action recognition using joint quadruples. In: ICPR, pp 4513–4518

  18. Fan KC, Hung TY (2014) A novel local pattern descriptor - local vector pattern in High-Order derivative space for face recognition. IEEE Trans Image Process 23 (7):2877–2891

    Article  MathSciNet  MATH  Google Scholar 

  19. Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  20. Gao Z, Zhang H, Xu G, Xue Y (2015) Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition. Neurocomputing 151(Part 2):554–564

    Article  Google Scholar 

  21. Girshick R (2015) Fast r-CNN. In: ICCV, pp 1440–1448

  22. Guo K, Ishwar P, Konrad J (2013) Action recognition from video using feature covariance matrices. IEEE Trans Image Process 22(6):2479–2494

    Article  MathSciNet  MATH  Google Scholar 

  23. Harandi MT, Sanderson C, Sanin A, Lovell BC (2013) Spatio-temporal covariance descriptors for action and gesture recognition. In: WACV, pp 103–110

  24. Huang Z, Wan C, Probst T, Gool LV (2017) Deep learning on lie groups for Skeleton-Based action recognition. In: CVPR

  25. Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. In: IJCAI, pp 2466–2472

  26. Koppula HS, Gupta R, Saxena A (2013) Learning human activities and object affordances from RGB-d videos. Int J Robot Res 32(8):951–970

    Article  Google Scholar 

  27. Koppula HS, Saxena A (2016) Anticipating human activities using object affordances for reactive robotic response. TPAMI 38(1):14–29

    Article  Google Scholar 

  28. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105

  29. Kurakin A, Zhang Z, Liu Z (2012) A real time system for dynamic hand gesture recognition with a depth sensor. In: EUSIPCO, pp 1975–1979

  30. Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: ICML, pp 609–616

  31. Lee I, Kim D, Kang S, Lee S (2017) Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In: CVPR

  32. Li P, Wang Q (2012) local log-euclidean covariance matrix (l2ECM) for image representation and its applications. In: ECCV, pp 469–482

  33. Li P, Wang Q, Zeng H, Zhang L (2017) local Log-Euclidean multivariate gaussian descriptor and its application to image classification. TPAMI 39(4):803–817

    Article  Google Scholar 

  34. Li Q, Stankovic JA, Hanson MA, Barth AT, Lach J, Zhou G (2009) Accurate, fast fall detection using gyroscopes and Accelerometer-Derived posture information. In: Proceedings of the sixth international workshop on wearable and implantable body sensor networks, pp 138– 143

  35. Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: CVPRW, pp 9–14

  36. Lin YC, Hu MC, Cheng WH, Hsieh YH, Chen HM (2012) Human action recognition and retrieval using sole depth information. In: ACM MM, pp 1053–1056

  37. Liu A, Nie W, Su Y, Ma L, Hao T, Yang Z (2015) Coupled hidden conditional random fields for RGB-d human action recognition. Signal Process 112(C):74–82

    Article  Google Scholar 

  38. Liu C, Hu Y, Li Y, Song S, Liu J (2017) PKU-MMD: a large scale benchmark for continuous multi-modal human action understanding. arXiv:1703.07475

  39. Liu J, Wang G, Hu P, Duan LY, Kot AC (2017) Global Context-Aware attention LSTM networks for 3D action recognition. In: CVPR

  40. Liu L, Cheng L, Liu Y, Jia Y, Rosenblum DS (2016) Recognizing complex activities by a probabilistic interval-based model. In: AAAI, pp 1266–1272

  41. Liu L, Shao L (2013) Learning discriminative representations from RGB-d video data. In: IJCAI, pp 1493–1500

  42. Liu M, Liu H, Chen C (2017) 3D action recognition using multi-scale energy-based global ternary image. IEEE Transactions on Circuits and Systems for Video Technology. https://doi.org/10.1109/TCSVT.2017.2655521

  43. Liu Y, Cui J, Zhao H, Zha H (2012) Fusion of low-and high-dimensional approaches by trackers sampling for generic human motion tracking. In: ICPR, pp 898–901

  44. Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2Activity: recognizing complex activities from sensor data. In: IJCAI, pp 1617–1623

  45. Liu Y, Nie L, Liu L, Rosenblum DS (2016) From action to activity: Sensor-based activity recognition. Neurocomputing 181(C):108–115

    Article  Google Scholar 

  46. Liu Y, Zhang X, Cui J, Wu C, Aghajan H, Zha H (2010) Visual analysis of child-adult interactive behaviors in video sequences. In: 2010 16Th international conference on virtual systems and multimedia, pp 26–33

  47. Liu Z, Zhang C, Tian Y (2016) 3D-based deep convolutional neural network for action recognition with depth sequences. Image Vis Comput 55(Part 2):93–100

    Article  Google Scholar 

  48. Lovrić M, Min-Oo M, Ruh EA (2000) Multivariate normal distributions parametrized as a riemannian symmetric space. J Multivar Anal 74(1):36–48

    Article  MathSciNet  MATH  Google Scholar 

  49. Lu Y, Wei Y, Liu L, Zhong J, Sun L, Liu Y (2017) Towards unsupervised physical activity recognition using smartphone accelerometers. Multimedia Tools and Applications 76(8):10,701– 10,719

    Article  Google Scholar 

  50. Luo J, Wang W, Qi H (2013) Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: ICCV, pp 1809–1816

  51. Matsukawa T, Okabe T, Suzuki E, Sato Y (2016) Hierarchical gaussian descriptor for person Re-identification. In: CVPR, pp 1363–1372

  52. Mici L, Parisi GI, Wermter S (2017) A self-organizing neural network architecture for learning human-object interactions. arXiv:1710.01916

  53. Mikolajczyk K, Schmid C (2005) A performance evaluation of local descriptors. TPAMI 27(10):1615–1630

    Article  Google Scholar 

  54. Ojala T, Pietikainen M, Harwood D (1994) Performance evaluation of texture measures with classification based on kullback discrimination of distributions. In: ICPR, vol 1, pp 582–585

  55. Oreifej O, Liu Z (2013) HON4d: histogram of oriented 4D normals for activity recognition from depth sequences. In: CVPR, pp 716–723

  56. Rahmani H, Bennamoun M (2017) Learning action recognition model from depth and skeleton videos. In: ICCV

  57. Rahmani H, Mian A (2016) 3D action recognition from novel viewpoints. In: CVPR, pp 1506–1515

  58. Rezazadegan F, Shirazi S, Upcroft B, Milford M (2017) Action recognition: From static datasets to moving robots. arXiv:1701.04925

  59. Sanchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. IJCV 105(3):222–245

    Article  MathSciNet  MATH  Google Scholar 

  60. Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117

    Article  Google Scholar 

  61. Shahroudy A, Liu J, Ng TT, Wang G (2016) NTU RGB + d: a large scale dataset for 3D human activity analysis. In: CVPR, pp 1010–1019

  62. Shi Z, Kim TK (2017) Learning and refining of privileged information-based RNNs for action recognition from depth sequences. In: CVPR

  63. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576

  64. Tuzel O, Porikli F, Meer P (2008) Pedestrian detection via classification on riemannian manifolds. TPAMI 30(10):1713–1727

    Article  Google Scholar 

  65. Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In: ACM MM, pp 1469–1472

  66. Veeriah V, Zhuang N, Qi GJ (2015) Differential recurrent neural networks for action recognition. In: ICCV, pp 4041–4049

  67. Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3D skeletons as points in a lie group. In: CVPR, pp 588–595

  68. Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012) Robust 3D action recognition with random occupancy patterns. In: ECCV, pp 872–885

  69. Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: CVPR, pp 1290–1297

  70. Wang L, Zhang J, Zhou L, Tang C, Li W (2015) Beyond covariance: feature representation with nonlinear kernel matrices. In: ICCV, pp 4570–4578

  71. Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona PO (2016) Action recognition from depth maps using deep convolutional neural networks. IEEE Transactions on Human-Machine Systems 46(4):498– 509

    Article  Google Scholar 

  72. Wang P, Li Z, Hou Y, Li W (2016) Action recognition based on joint trajectory maps using convolutional neural networks. In: ACM MM, pp 102–106

  73. Wang P, Wang S, Gao Z, Hou Y, Li W (2017) Structured images for RGB-d action recognition. In: ICCV workshop

  74. Wang Q, Li P, Zhang L, Zuo W (2016) Towards effective Codebookless model for image classification. Pattern Recogn 59(C):63–71

    Article  Google Scholar 

  75. Xia L, Aggarwal JK (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: CVPR, pp 2834–2841

  76. Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. In: CVPR, pp 804–811

  77. Yang X, Tian YL (2012) EigenJoints-based action recognition using Naive-Bayes-Nearest-Neighbor. In: CVPRW, pp 14–19

  78. Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: ACM MM, pp 1057–1060

  79. Yi Y, Wang H (2017) Motion keypoint trajectory and covariance descriptor for human action recognition. The Visual Computer. https://doi.org/10.1007/s00371-016-1345-6

  80. Yu M, Liu L, Shao L (2016) Structure-Preserving binary representations for RGB-d action recognition. TPAMI 38(8):1651–1664

    Article  Google Scholar 

  81. Yuan C, Hu W, Li X, Maybank S, Luo G (2010) Human action recognition under log-euclidean riemannian metric. In: ACCV, pp 343–353

  82. Zhang B, Yang Y, Chen C, Yang L, Han J, Shao L (2017) Action recognition using 3D histograms of texture and a Multi-Class boosting classifier. IEEE Trans Image Process 26(10):4648– 4660

    Article  MathSciNet  Google Scholar 

  83. Zhang C, Tian Y (2015) Histogram of 3D facets. CVIU 139(C):29–39

    Google Scholar 

  84. Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. TPAMI 29(6):915–928

    Article  Google Scholar 

  85. Zhou L, Li W, Zhang Y, Ogunbona P, Nguyen DT, Zhang H (2014) Discriminative key pose extraction using extended LC-KSVD for action recognition. In: DICTA, pp 1–8

  86. Zhu Y, Chen W, Guo G (2013) Fusing spatiotemporal features and joints for 3D action recognition. In: CVPRW, pp 486–491

Download references

Acknowledgments

Portions of the research in this paper use the DHA video dataset collected by Research Center for Information Technology Innovation (CITI), Academia Sinica.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuan Son Nguyen.

Appendix: Example to Calculate Binary Patterns for Shape And Motion Cues

Appendix: Example to Calculate Binary Patterns for Shape And Motion Cues

An example to calculate LVPshape,α,D(Gr) with α = 00 and D = 1 is given in Fig. 9, where the referenced pixel Gr is marked in red with its depth value. The first out of 8 bits of LVPshape,α,D(Gr) is calculated from \(I^{\prime }_{\alpha ,D}(G_{1,r})\), \(I^{\prime }_{\alpha + 45^{0},D}(G_{1,r})\), \(I^{\prime }_{\alpha ,D}(G_{r})\) and \(I^{\prime }_{\alpha + 45^{0},D}(G_{r})\). Since \(I^{\prime }_{\alpha ,D}(G_{1,r})= 7\), \(I^{\prime }_{\alpha + 45^{0},D}(G_{1,r})= 1\), \(I^{\prime }_{\alpha ,D}(G_{r})=-3\), \(I^{\prime }_{\alpha + 45^{0},D}(G_{r})= 1\) and \(1 - \frac {1}{-3} \times 7 > 0\), the first bit of \(LVP_{shape,\alpha ,D}(G_{r})\) is assigned to 1. Using similar calculations, the obtained binary code LVPshape,α,D(Gr) is 11010100, which is 212 in decimal form.

Fig. 9
figure 9

(Best viewed in color) Example to calculate \(\protect LVP_{shape,\alpha ,D}(G_{r})\) with α = 00 and D = 1. The numbers in each image patch represent the depth values of pixels. The referenced pixel Gr is marked in red, the neighboring pixels which are involved in the calculation are marked in blue. The two directions for calculating derivatives at related pixels are shown by the arrows

Another example to calculate \(LVP_{motion,D_{1},D_{2}}(G_{r})\) with D1 = − 1 and D2 = 1 is given in Fig. 10. The first out of 8 bits of \(LVP_{motion,D_{1},D_{2}}(G_{r})\) is calculated from \(I^{\prime }_{D_{1}}(G_{1,r})= 2\), \(I^{\prime }_{D_{2}}(G_{1,r})= 1\), \(I^{\prime }_{D_{1}}(G_{r})= 3\) and \(I^{\prime }_{D_{2}}(G_{r})=-1\). Since \(1 - \frac {-1}{3} \times 2 > 0\), the first bit of \(LVP_{motion,D_{1},D_{2}}(G_{r})\) is assigned to 1. The obtained binary code \(LVP_{motion,D_{1},D_{2}}(G_{r})\) is 10101101, which is 173 in decimal form.

Fig. 10
figure 10

(Best viewed in color) Example to calculate \(\protect LVP_{motion,D_{1},D_{2}}(G_{r})\) with D1 = − 1 and D2 = 1. The numbers in each image patch represent the depth values of pixels. The motion transformations of the referenced pixel Gr are marked in red, those of the neighboring pixels which are involved in the calculation are marked in blue

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nguyen, X.S., Mouaddib, AI., Nguyen, T.P. et al. Action recognition in depth videos using hierarchical gaussian descriptor. Multimed Tools Appl 77, 21617–21652 (2018). https://doi.org/10.1007/s11042-017-5593-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-5593-x

Keywords

Navigation