Skip to main content

Advertisement

Log in

Multimodal human action recognition based on spatio-temporal action representation recognition model

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Human action recognition methods based on single-modal data lack adequate information. It is necessary to propose the methods based on multimodal data and the fusion algorithms to fuse different features. Meanwhile, the existing features extracted from depth videos and skeleton sequences are not representative. In this paper, we propose a new model named Spatio-temporal Action Representation Recognition Model for recognizing human actions. This model proposes a new depth feature map called Hierarchical Pyramid Depth Motion Images (HP-DMI) to represent depth videos and adopts Spatial-temporal Graph Convolutional Networks (ST-GCN) extractor to summarize skeleton features named Spatio-temporal Joint Descriptors (STJD). Histogram of Oriented Gradient (HOG) is used on HP-DMI to extract HP-DMI-HOG features. Then two kinds of features are input into a fusion algorithm High Trust Mean Canonical correlation analysis (HTMCCA). HTMCCA mitigates the impact of noisy samples on multi-feature fusion and reduces computational complexity. Finally, Support Vector Machine (SVM) is used for human action recognition. To evaluate the performance of our approach, several experiments are conducted on two public datasets. Eexperiments results prove its effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

The authors declare that data supporting the findings of this study are available within the article.

References

  1. Amor BB, Su J, Srivastava A (2015) Action recognition using rate-invariant analysis of skeletal shape trajectories. IEEE Trans Pattern Anal Machine Intell 38(1):1–13

    Article  Google Scholar 

  2. Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Machine Intell 23(3):257–267

    Article  Google Scholar 

  3. Bregonzio M, Gong S, Xiang T (2009) Recognising action as clouds of space-time interest points. In: 2009 IEEE conference on computer vision and pattern recognition, pp 1948–1955

  4. Bulbul MF, Islam S, Ali H (2019) 3D human action analysis and recognition through GLAC descriptor on 2D motion and static posture images. Multimed Tools Appl 78(15):21085–21111

    Article  Google Scholar 

  5. Chen C, Jafari R, Kehtarnavaz N (2015) UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. IEEE Int Conf Image Process (ICIP) 2015:168–172

    Google Scholar 

  6. Chao X, Hou Z, Liang J, Yang T (2020) Integrally cooperative spatio-temporal feature representation of motion joints for action recognition. Sensors 20 (18):5180

    Article  Google Scholar 

  7. Chen C, Jafari R, Kehtarnavaz N (2015) UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: 2015 IEEE international conference on image processing (ICIP), pp 168–172

  8. Cherkassky V, Ma Y (2004) Practical selection of SVM parameters and noise estimation for SVM regression. Neur Netw 17(1):113–126

    Article  MATH  Google Scholar 

  9. Cutler A, Cutler DR, Stevens JR (2012) Random forests. In: Ensemble machine learning, pp 157–175

  10. Das S, Sharma S, Dai R, Bremond F, Thonnat M (2020) Vpn: learning video-pose embedding for activities of daily living. In: European conference on computer vision, pp 72–90

  11. Dhiman C, Vishwakarma DK (2019) A review of state-of-the-art techniques for abnormal human activity recognition. Eng Appl Artif Intell 77:21–45

    Article  Google Scholar 

  12. Dhiman C, Vishwakarma DK (2020) View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans Image Process 29:3835–3844

    Article  MATH  Google Scholar 

  13. Elmadany NED, He Y, Guan L (2018) Information fusion for human action recognition via biset/multiset globality locality preserving canonical correlation analysis. IEEE Trans Image Process 27(11):5275–5287

    Article  MathSciNet  Google Scholar 

  14. Elmadany NED, He Y, Guan L (2018) Multimodal learning for human action recognition via bimodal/multimodal hybrid centroid canonical correlation analysis. IEEE Trans Multimed 21(5):1317–1331

    Article  Google Scholar 

  15. Gowayyed MA, Torki M, Hussein ME, El-Saban M (2013) Histogram of oriented displacements (HOD): describing trajectories of human joints for action recognition. In: IJCAI, vol 1, pp 1351–1357

  16. Guo G, Wang H, Bell D, Bi Y, Greer K (2003) KNN Model-based approach in classification. In: OTM confederated international conferences on the move to meaningful internet systems, pp 986–996

  17. Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16 (12):2639–2664

    Article  MATH  Google Scholar 

  18. Hou Y, Li Z, Wang P, Li W (2016) Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Trans Circuits Syst Video Technol 28(3):807–811

    Article  Google Scholar 

  19. Hu JF, Zheng WS, Pan J, Lai J, Zhang J (2018) Deep bilinear learning for rgb-d action recognition. In: Proceedings of the European conference on computer vision (ECCV), pp 335–351

  20. Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1-3):489–501

    Article  Google Scholar 

  21. Kamel A, Sheng B, Yang P, Li P, Shen R, Feng DD (2018) Deep convolutional neural networks for human action recognition using depth maps and postures. IEEE Trans Syst Man Cybern Syst 49(9):1806–1819

    Article  Google Scholar 

  22. Kan M, Shan S, Zhang H, Lao S, Chen X (2015) Multi-view discriminant analysis. IEEE Trans Pattern Anal Machine Intell 38(1):188–194

    Article  Google Scholar 

  23. Kattenborn T, Leitloff J, Schiefer F, Hinz S (2021) Review on Convolutional Neural Networks (CNN) in vegetation remote sensing. ISPRS J Photogramm Remote Sens 173:24–49

    Article  Google Scholar 

  24. Ke Q, An S, Bennamoun M, Sohel F, Boussaid F (2017) Skeletonnet: mining deep part features for 3-d action recognition. IEEE Signal Process Lett 24(6):731–735

    Article  Google Scholar 

  25. Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297

  26. Khaire P, Kumar P, Imran J (2018) Combining CNN streams of RGB-d and skeletal data for human activity recognition. Pattern Recogn Lett 115:107–116

    Article  Google Scholar 

  27. Khaire P, Imran J, Kumar P (2018) Human activity recognition by fusion of rgb, depth, and skeletal data. In: Proceedings of 2nd international conference on computer vision & image processing, pp 409–421

  28. Kim HG, Kim GY, Kim JY (2019) Music recommendation system using human activity recognition from accelerometer data. IEEE Trans Consum Electron 65(3):349–358

    Article  Google Scholar 

  29. Koniusz P, Cherian A, Porikli F (2016) Tensor representations via kernel linearization for action recognition from 3d skeletons. In: European conference on computer vision, pp 37–53

  30. Li C, Hou Y, Wang P, Li W (2017) Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Process Lett 24(5):624–628

    Article  Google Scholar 

  31. Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p 2019

  32. Li J, Xie X, Pan Q, Cao Y, Zhao Z, Shi G (2020) SGM-Net: skeleton-guided multimodal network for action recognition. Pattern Recogn 104:107356

    Article  Google Scholar 

  33. Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 1996–2003

  34. Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2010:9–14

    Google Scholar 

  35. Nguyen XS, Mouaddib AI, Nguyen TP, Jeanpierre L (2018) Action recognition in depth videos using hierarchical gaussian descriptor. Multimed Tools Appl 77(16):21617–21652

    Article  Google Scholar 

  36. Ohn-Bar E, Trivedi M (2013) Joint angles similarities and HOG2 for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 465–470

  37. Oreifej O, Liu Z (2013) Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 716–723

  38. Qin X, Ge Y, Feng J, Yang D, Chen F, Huang S, Xu L (2020) DTMMN: Deep Transfer multi-metric network for RGB-d action recognition. Neurocomputing 406:127–134

    Article  Google Scholar 

  39. Rahmani H, Bennamoun M (2017) Learning action recognition model from depth and skeleton videos. In: Proceedings of the IEEE international conference on computer vision, pp 5832–5841

  40. Rani SS, Naidu GA, Shree VU (2021) Kinematic joint descriptor and depth motion descriptor with convolutional neural networks for human action recognition. Materials Today: Proceedings 37:3164–3173

    Google Scholar 

  41. Rasiwasia N, Mahajan D, Mahadevan V, Aggarwal G (2014) Cluster canonical correlation analysis. In: Artificial intelligence and statistics. PMLR, pp 823–831

  42. Shahroudy A, Ng TT, Gong Y, Wang G (2017) Deep multimodal feature analysis for action recognition in rgb+ d videos. IEEE Trans Pattern Anal Machine Intell 40(5):1045–1058

    Article  Google Scholar 

  43. Shahroudy A, Liu J, Ng TT, Wang G (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019

  44. Sharma A, Kumar A, Daume H, Jacobs DW (2012) Generalized multiview analysis: a discriminative latent space. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 2160–2167

  45. Si C, Jing Y, Wang W, Wang L, Tan T (2020) Skeleton-based action recognition with hierarchical spatial reasoning and temporal stack learning network. Pattern Recogn 107:107511

    Article  Google Scholar 

  46. Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1227–1236

  47. Song S, Lan C, Xing J, Zeng W, Liu J (2018) Skeleton-indexed deep multi-modal feature learning for high performance human action recognition. In: 2018 IEEE international conference on multimedia and expo (ICME). IEEE, pp 1–6

  48. Sun L, Jia K, Chen K, Yeung DY, Shi BE, Savarese S (2017) Lattice long short-term memory for human action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2147–2156

  49. Tran QD, Ly NQ (2013) Sparse spatio-temporal representation of joint shape-motion cues for human action recognition in depth sequences. In: The 2013 RIVF international conference on computing & communication technologies-research innovation, and vision for future (RIVF), pp 253–258

  50. Vishwakarma DK, Kapoor R (2012) Simple and intelligent system to recognize the expression of speech-disabled person. In: 2012 4th international conference on intelligent human computer interaction (IHCI), pp 1–6

  51. Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595

  52. Vishwakarma DK, Kapoor R, Maheshwari R, Kapoor V, Raman S (2015) Recognition of abnormal human activity using the changes in orientation of silhouette in key frames. In: 2015 2nd international conference on computing for sustainable global development (INDIACom), pp 336–341

  53. Vishwakarma DK, Kapoor R (2017) An efficient interpretation of hand gestures to control smart interactive television. International Journal of Computational Vision and Robotics 7(4):454– 471

    Article  Google Scholar 

  54. Wang H, Song Z, Li W, Wang P (2020) A hybrid network for large-scale action recognition from rgb and depth modalities. Sensors 20(11):3305

    Article  Google Scholar 

  55. Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012) Robust 3d action recognition with random occupancy patterns. In: European conference on computer vision, pp 872–885

  56. Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE conference on computer vision and pattern recognition, pp 1290– 1297

  57. Wang L, Ding Z, Tao Z, Liu Y, Fu Y (2019) Generative multi-view human action recognition. In: Inproceedings of the IEEE/CVF international conference on computer vision, pp 6212–6221

  58. Wang K, He R, Wang L, Wang W, Tan T (2015) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans Pattern Anal Machine Intell 38(10):2010–2023

    Article  Google Scholar 

  59. Wang P, Li W, Gao Z, Zhang Y, Tang C, Ogunbona P (2017) Scene flow to action map: a new representation for rgb-d based action recognition with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 595–604

  60. Wei P, Sun H, Zheng N (2019) Learning composite latent structures for 3D human action representation and recognition. IEEE Trans Multimedia 21 (9):2195–2208

    Article  Google Scholar 

  61. Xia L, Aggarwal JK (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2834–2841

  62. Xia L, Chen CC, Aggarwal JK (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, pp 20–27

  63. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI conference on artificial intelligence

  64. Yang X, Tian YL (2012) Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, pp 14–19

  65. Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 804–811

  66. Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: Proceedings of the 20th ACM international conference on Multimedia, pp 1057–1060

  67. Zhao C, Chen M, Zhao J, Wang Q, Shen Y (2019) 3d behavior recognition based on multi-modal deep space-time learning. Appl Sci 9(4):716

    Article  Google Scholar 

  68. Zolfaghari M, Oliveira GL, Brox Sedaghat N T (2017) Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In: 2017 Ieee international conference on computer vision, IEEE international conference on computer vision, pp 2923–2932

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qian Huang.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflicts of interest to report regarding the present study.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, Q., Huang, Q. & Li, X. Multimodal human action recognition based on spatio-temporal action representation recognition model. Multimed Tools Appl 82, 16409–16430 (2023). https://doi.org/10.1007/s11042-022-14193-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-14193-0

Keywords

Navigation