Action recognition in depth videos using hierarchical gaussian descriptor

Nguyen, Xuan Son; Mouaddib, Abdel-Illah; Nguyen, Thanh Phuong; Jeanpierre, Laurent

doi:10.1007/s11042-017-5593-x

Action recognition in depth videos using hierarchical gaussian descriptor

Published: 13 January 2018

Volume 77, pages 21617–21652, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Xuan Son Nguyen ORCID: orcid.org/0000-0002-2776-2254¹,
Abdel-Illah Mouaddib¹,
Thanh Phuong Nguyen^2,3 &
…
Laurent Jeanpierre¹

512 Accesses
10 Citations
Explore all metrics

Abstract

In this paper, we propose a new approach based on distribution descriptors for action recognition in depth videos. Our local features are computed from binary patterns which incorporate the shape and motion cues for effective action recognition. Given pixel-level features, our approach estimates video local statistics in a hierarchical manner, where the distribution of pixel-level features and that of frame-level descriptors are modeled using single Gaussians. In this way, our approach constructs video descriptors directly from low-level features without resorting to codebook learning required by Bag-of-features (BoF) based approaches. In order to capture the spatial geometry and temporal order of a video, we use a spatio-temporal pyramid representation for each video. Our approach is validated on six benchmark datasets, i.e. MSRAction3D, MSRGesture3D, DHA, SKIG, UTD-MHAD and CAD-120. The experimental results show that our approach gives good performance on all the datasets. In particular, it achieves state-of-the-art accuracies on DHA, SKIG and UTD-MHAD datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A Comparative Study of Different Feature Descriptors for Video-Based Human Action Recognition

Real-Time Human Action Recognition Using DMMs-Based LBP and EOH Features

A Robust Framework for the Recognition of Human Action and Activity Using Spatial Distribution Gradients and Gabor Wavelet

References

Ahonen T, Hadid A, Pietikainen M (2006) Face description with local binary patterns: application to face recognition. TPAMI 28(12):2037–2041
Article MATH Google Scholar
Altun K, Barshan B (2010) Human activity recognition using Inertial/Magnetic sensor units. In: Proceedings of the first international conference on human behavior understanding, pp 38–51
Arsigny V, Fillard P, Pennec X, Ayache N (2007) Geometric means in a novel vector space structure on symmetric Positive-Definite matrices. SIAM J Matrix Anal Appl 29(1):328–347
Article MathSciNet MATH Google Scholar
Bilinski P, Bremond F (2015) Video covariance matrix logarithm for human action recognition in videos. In: IJCAI, pp 2140–2147
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Cavazza J, Zunino A, Biagio MS, Murino V (2016) Kernelized covariance for action recognition. In: ICPR, pp 408–413
Chaaraoui AA, Padilla-Lopez JR, Florez-Revuelta F (2013) Fusion of skeletal and Silhouette-Based features for human action recognition with RGB-d devices. In: ICCVW, pp 91–97
Chen C, Jafari R, Kehtarnavaz N (2015) Action recognition from depth sequences using depth motion Maps-based local binary patterns. In: WACV, pp 1092–1099
Chen C, Jafari R, Kehtarnavaz N (2015) Improving human action recognition using fusion of depth camera and inertial sensors. IEEE Transactions on Human-Machine Systems 45(1):51– 61
Article Google Scholar
Chen C, Jafari R, Kehtarnavaz N (2015) UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: ICIP, pp 168–172
Chen C, Kehtarnavaz N, Jafari R (2014) A medication adherence monitoring system for pill bottles based on a wearable inertial sensor. In: 36Th annual international conference of the IEEE engineering in medicine and biology society, pp 4983–4986
Cirujeda P, Binefa X (2014) 4DCov: a nested covariance descriptor of Spatio-Temporal features for gesture recognition in depth sequences. In: 3DV, vol 1, pp 657–664
Cui J, Liu Y, Xu Y, Zhao H, Zha H (2013) Tracking generic human motion via fusion of low- and High-Dimensional approaches. IEEE Trans Syst Man Cybern Syst Hum 43(4):996– 1002
Article Google Scholar
Davis LS (2012) Covariance discriminative learning: a natural and efficient approach to image set classification. In: CVPR, pp 2496–2503
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp 1110–1118
Ermes M, PÄrkkÄ J, MÄntyjÄrvi J, Korhonen I (2008) Detection of daily activities and sports with wearable sensors in controlled and uncontrolled conditions. IEEE Trans Inf Technol Biomed 12(1): 20–26
Article Google Scholar
Evangelidis G, Singh G, Horaud R (2014) Skeletal quads: human action recognition using joint quadruples. In: ICPR, pp 4513–4518
Fan KC, Hung TY (2014) A novel local pattern descriptor - local vector pattern in High-Order derivative space for face recognition. IEEE Trans Image Process 23 (7):2877–2891
Article MathSciNet MATH Google Scholar
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
MATH Google Scholar
Gao Z, Zhang H, Xu G, Xue Y (2015) Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition. Neurocomputing 151(Part 2):554–564
Article Google Scholar
Girshick R (2015) Fast r-CNN. In: ICCV, pp 1440–1448
Guo K, Ishwar P, Konrad J (2013) Action recognition from video using feature covariance matrices. IEEE Trans Image Process 22(6):2479–2494
Article MathSciNet MATH Google Scholar
Harandi MT, Sanderson C, Sanin A, Lovell BC (2013) Spatio-temporal covariance descriptors for action and gesture recognition. In: WACV, pp 103–110
Huang Z, Wan C, Probst T, Gool LV (2017) Deep learning on lie groups for Skeleton-Based action recognition. In: CVPR
Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. In: IJCAI, pp 2466–2472
Koppula HS, Gupta R, Saxena A (2013) Learning human activities and object affordances from RGB-d videos. Int J Robot Res 32(8):951–970
Article Google Scholar
Koppula HS, Saxena A (2016) Anticipating human activities using object affordances for reactive robotic response. TPAMI 38(1):14–29
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105
Kurakin A, Zhang Z, Liu Z (2012) A real time system for dynamic hand gesture recognition with a depth sensor. In: EUSIPCO, pp 1975–1979
Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: ICML, pp 609–616
Lee I, Kim D, Kang S, Lee S (2017) Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In: CVPR
Li P, Wang Q (2012) local log-euclidean covariance matrix (l2ECM) for image representation and its applications. In: ECCV, pp 469–482
Li P, Wang Q, Zeng H, Zhang L (2017) local Log-Euclidean multivariate gaussian descriptor and its application to image classification. TPAMI 39(4):803–817
Article Google Scholar
Li Q, Stankovic JA, Hanson MA, Barth AT, Lach J, Zhou G (2009) Accurate, fast fall detection using gyroscopes and Accelerometer-Derived posture information. In: Proceedings of the sixth international workshop on wearable and implantable body sensor networks, pp 138– 143
Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: CVPRW, pp 9–14
Lin YC, Hu MC, Cheng WH, Hsieh YH, Chen HM (2012) Human action recognition and retrieval using sole depth information. In: ACM MM, pp 1053–1056
Liu A, Nie W, Su Y, Ma L, Hao T, Yang Z (2015) Coupled hidden conditional random fields for RGB-d human action recognition. Signal Process 112(C):74–82
Article Google Scholar
Liu C, Hu Y, Li Y, Song S, Liu J (2017) PKU-MMD: a large scale benchmark for continuous multi-modal human action understanding. arXiv:1703.07475
Liu J, Wang G, Hu P, Duan LY, Kot AC (2017) Global Context-Aware attention LSTM networks for 3D action recognition. In: CVPR
Liu L, Cheng L, Liu Y, Jia Y, Rosenblum DS (2016) Recognizing complex activities by a probabilistic interval-based model. In: AAAI, pp 1266–1272
Liu L, Shao L (2013) Learning discriminative representations from RGB-d video data. In: IJCAI, pp 1493–1500
Liu M, Liu H, Chen C (2017) 3D action recognition using multi-scale energy-based global ternary image. IEEE Transactions on Circuits and Systems for Video Technology. https://doi.org/10.1109/TCSVT.2017.2655521
Liu Y, Cui J, Zhao H, Zha H (2012) Fusion of low-and high-dimensional approaches by trackers sampling for generic human motion tracking. In: ICPR, pp 898–901
Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2Activity: recognizing complex activities from sensor data. In: IJCAI, pp 1617–1623
Liu Y, Nie L, Liu L, Rosenblum DS (2016) From action to activity: Sensor-based activity recognition. Neurocomputing 181(C):108–115
Article Google Scholar
Liu Y, Zhang X, Cui J, Wu C, Aghajan H, Zha H (2010) Visual analysis of child-adult interactive behaviors in video sequences. In: 2010 16Th international conference on virtual systems and multimedia, pp 26–33
Liu Z, Zhang C, Tian Y (2016) 3D-based deep convolutional neural network for action recognition with depth sequences. Image Vis Comput 55(Part 2):93–100
Article Google Scholar
Lovrić M, Min-Oo M, Ruh EA (2000) Multivariate normal distributions parametrized as a riemannian symmetric space. J Multivar Anal 74(1):36–48
Article MathSciNet MATH Google Scholar
Lu Y, Wei Y, Liu L, Zhong J, Sun L, Liu Y (2017) Towards unsupervised physical activity recognition using smartphone accelerometers. Multimedia Tools and Applications 76(8):10,701– 10,719
Article Google Scholar
Luo J, Wang W, Qi H (2013) Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: ICCV, pp 1809–1816
Matsukawa T, Okabe T, Suzuki E, Sato Y (2016) Hierarchical gaussian descriptor for person Re-identification. In: CVPR, pp 1363–1372
Mici L, Parisi GI, Wermter S (2017) A self-organizing neural network architecture for learning human-object interactions. arXiv:1710.01916
Mikolajczyk K, Schmid C (2005) A performance evaluation of local descriptors. TPAMI 27(10):1615–1630
Article Google Scholar
Ojala T, Pietikainen M, Harwood D (1994) Performance evaluation of texture measures with classification based on kullback discrimination of distributions. In: ICPR, vol 1, pp 582–585
Oreifej O, Liu Z (2013) HON4d: histogram of oriented 4D normals for activity recognition from depth sequences. In: CVPR, pp 716–723
Rahmani H, Bennamoun M (2017) Learning action recognition model from depth and skeleton videos. In: ICCV
Rahmani H, Mian A (2016) 3D action recognition from novel viewpoints. In: CVPR, pp 1506–1515
Rezazadegan F, Shirazi S, Upcroft B, Milford M (2017) Action recognition: From static datasets to moving robots. arXiv:1701.04925
Sanchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. IJCV 105(3):222–245
Article MathSciNet MATH Google Scholar
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Article Google Scholar
Shahroudy A, Liu J, Ng TT, Wang G (2016) NTU RGB + d: a large scale dataset for 3D human activity analysis. In: CVPR, pp 1010–1019
Shi Z, Kim TK (2017) Learning and refining of privileged information-based RNNs for action recognition from depth sequences. In: CVPR
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576
Tuzel O, Porikli F, Meer P (2008) Pedestrian detection via classification on riemannian manifolds. TPAMI 30(10):1713–1727
Article Google Scholar
Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In: ACM MM, pp 1469–1472
Veeriah V, Zhuang N, Qi GJ (2015) Differential recurrent neural networks for action recognition. In: ICCV, pp 4041–4049
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3D skeletons as points in a lie group. In: CVPR, pp 588–595
Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012) Robust 3D action recognition with random occupancy patterns. In: ECCV, pp 872–885
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: CVPR, pp 1290–1297
Wang L, Zhang J, Zhou L, Tang C, Li W (2015) Beyond covariance: feature representation with nonlinear kernel matrices. In: ICCV, pp 4570–4578
Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona PO (2016) Action recognition from depth maps using deep convolutional neural networks. IEEE Transactions on Human-Machine Systems 46(4):498– 509
Article Google Scholar
Wang P, Li Z, Hou Y, Li W (2016) Action recognition based on joint trajectory maps using convolutional neural networks. In: ACM MM, pp 102–106
Wang P, Wang S, Gao Z, Hou Y, Li W (2017) Structured images for RGB-d action recognition. In: ICCV workshop
Wang Q, Li P, Zhang L, Zuo W (2016) Towards effective Codebookless model for image classification. Pattern Recogn 59(C):63–71
Article Google Scholar
Xia L, Aggarwal JK (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: CVPR, pp 2834–2841
Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. In: CVPR, pp 804–811
Yang X, Tian YL (2012) EigenJoints-based action recognition using Naive-Bayes-Nearest-Neighbor. In: CVPRW, pp 14–19
Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: ACM MM, pp 1057–1060
Yi Y, Wang H (2017) Motion keypoint trajectory and covariance descriptor for human action recognition. The Visual Computer. https://doi.org/10.1007/s00371-016-1345-6
Yu M, Liu L, Shao L (2016) Structure-Preserving binary representations for RGB-d action recognition. TPAMI 38(8):1651–1664
Article Google Scholar
Yuan C, Hu W, Li X, Maybank S, Luo G (2010) Human action recognition under log-euclidean riemannian metric. In: ACCV, pp 343–353
Zhang B, Yang Y, Chen C, Yang L, Han J, Shao L (2017) Action recognition using 3D histograms of texture and a Multi-Class boosting classifier. IEEE Trans Image Process 26(10):4648– 4660
Article MathSciNet Google Scholar
Zhang C, Tian Y (2015) Histogram of 3D facets. CVIU 139(C):29–39
Google Scholar
Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. TPAMI 29(6):915–928
Article Google Scholar
Zhou L, Li W, Zhang Y, Ogunbona P, Nguyen DT, Zhang H (2014) Discriminative key pose extraction using extended LC-KSVD for action recognition. In: DICTA, pp 1–8
Zhu Y, Chen W, Guo G (2013) Fusing spatiotemporal features and joints for 3D action recognition. In: CVPRW, pp 486–491

Download references

Acknowledgments

Portions of the research in this paper use the DHA video dataset collected by Research Center for Information Technology Innovation (CITI), Academia Sinica.

Author information

Authors and Affiliations

CNRS, GREYC, Université de Caen Basse-Normandie, UMR 6072, 14000, Caen, France
Xuan Son Nguyen, Abdel-Illah Mouaddib & Laurent Jeanpierre
CNRS, ENSAM, LSIS, Aix Marseille Université, UMR 7296, 13397, Marseille, France
Thanh Phuong Nguyen
CNRS, LSIS, Université de Toulon, UMR 7296, 83957, La Garde, France
Thanh Phuong Nguyen

Authors

Xuan Son Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Abdel-Illah Mouaddib
View author publications
You can also search for this author in PubMed Google Scholar
Thanh Phuong Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Laurent Jeanpierre
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xuan Son Nguyen.

Appendix: Example to Calculate Binary Patterns for Shape And Motion Cues

An example to calculate LVP_shape,α,D(G_r) with α = 0⁰ and D = 1 is given in Fig. 9, where the referenced pixel G_r is marked in red with its depth value. The first out of 8 bits of LVP_shape,α,D(G_r) is calculated from \(I^{\prime }_{\alpha ,D}(G_{1,r})\), \(I^{\prime }_{\alpha + 45^{0},D}(G_{1,r})\), \(I^{\prime }_{\alpha ,D}(G_{r})\) and \(I^{\prime }_{\alpha + 45^{0},D}(G_{r})\). Since \(I^{\prime }_{\alpha ,D}(G_{1,r})= 7\), \(I^{\prime }_{\alpha + 45^{0},D}(G_{1,r})= 1\), \(I^{\prime }_{\alpha ,D}(G_{r})=-3\), \(I^{\prime }_{\alpha + 45^{0},D}(G_{r})= 1\) and \(1 - \frac {1}{-3} \times 7 > 0\), the first bit of \(LVP_{shape,\alpha ,D}(G_{r})\) is assigned to 1. Using similar calculations, the obtained binary code LVP_shape,α,D(G_r) is 11010100, which is 212 in decimal form.

Another example to calculate \(LVP_{motion,D_{1},D_{2}}(G_{r})\) with D₁ = − 1 and D₂ = 1 is given in Fig. 10. The first out of 8 bits of \(LVP_{motion,D_{1},D_{2}}(G_{r})\) is calculated from \(I^{\prime }_{D_{1}}(G_{1,r})= 2\), \(I^{\prime }_{D_{2}}(G_{1,r})= 1\), \(I^{\prime }_{D_{1}}(G_{r})= 3\) and \(I^{\prime }_{D_{2}}(G_{r})=-1\). Since \(1 - \frac {-1}{3} \times 2 > 0\), the first bit of \(LVP_{motion,D_{1},D_{2}}(G_{r})\) is assigned to 1. The obtained binary code \(LVP_{motion,D_{1},D_{2}}(G_{r})\) is 10101101, which is 173 in decimal form.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nguyen, X.S., Mouaddib, AI., Nguyen, T.P. et al. Action recognition in depth videos using hierarchical gaussian descriptor. Multimed Tools Appl 77, 21617–21652 (2018). https://doi.org/10.1007/s11042-017-5593-x

Download citation

Received: 01 June 2017
Revised: 14 December 2017
Accepted: 27 December 2017
Published: 13 January 2018
Issue Date: August 2018
DOI: https://doi.org/10.1007/s11042-017-5593-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Action recognition in depth videos using hierarchical gaussian descriptor

Abstract

Access this article

Similar content being viewed by others

A Comparative Study of Different Feature Descriptors for Video-Based Human Action Recognition

Real-Time Human Action Recognition Using DMMs-Based LBP and EOH Features

A Robust Framework for the Recognition of Human Action and Activity Using Spatial Distribution Gradients and Gabor Wavelet

References

Acknowledgments