Skip to main content
Log in

Multi-view key information representation and multi-modal fusion for single-subject routine action recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Currently, human action recognition has witnessed remarkable progress, and its achievements have been applied to daily life. However, most methods extract features from only a single view within each modality, which may not comprehensively capture the diversity and complexity of actions. Moreover, the ineffective removal of redundant information can result in an inconspicuous description of key information. These issues cloud affect the final action recognition accuracy. To address these issues, this paper proposes a novel method for single-subject routine action recognition, which combines multi-view key information representation and multi-modal fusion. Firstly, the energy of non-primary motion areas is reduced by motion mean normalization in the depth video sequence, thereby enhancing key information of action. Then, depth motion history map (DMHM) and depth spatio-temporal energy map (DSTEM) are extracted from planes and axes, respectively. The proposed DMHM effectively preserves the spatio-temporal information of actions, DSTEM preserves the motion contour and energy information. In terms of skeleton sequences, statistical features and motion contribution degree of each joint are extracted from the view of motion distribution and weights, respectively. Finally, depth and skeleton features are fused to achieve multi-modal fusion-based action recognition. The proposed method highlights the information of the main motion areas, and achieves recognition accuracies of 96.70\(\%\) on MSR-Action3D, 93.26\(\%\) on UTD-MHAD, and above 97.73\(\%\) on all tests of CZU-MHAD. The experimental results demonstrate that the proposed method effectively preserves action information and has better recognition accuracy than most existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Algorithm 1
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

References

  1. Sun Z, Ke Q, Rahmani H, Bennamoun M, Wang G, Liu J (2023) Human action recognition from various data modalities: a review. IEEE Trans Pattern Anal Mach Intell 45(3):3200–3225

  2. Li T, Wang H, Fan D, Wang D, Yin L, Lan Q (2023) Research on virtual skiing system based on harmonious human-computer interaction. In: Proceedings of 2022 international conference on virtual reality, human-computer interaction and artificial intelligence (VRHCIAI), Changsha, China

  3. Ludl D, Gulde T, Curio C (2020) Enhancing data-driven algorithms for human pose estimation and action recognition through simulation. IEEE Trans Intell Transp Syst 21(9):3990–3999

    Article  Google Scholar 

  4. Ma W, Xiong H, Dai X, Zheng X, Zhou Y (2018) An indoor scene recognition-based 3D registration mechanism for real-time AR-GIS visualization in mobile applications. ISPRS Int J Geo Inf 7(3):112

    Article  Google Scholar 

  5. Cong R, Lei J, Fu H, Hou J, Huang Q, Kwong S (2020) Going from RGB to RGBD saliency: a depth-guided transformation model. IEEE Transactions on Cybernetics 50(8):3627–3639

    Article  Google Scholar 

  6. Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267

    Article  Google Scholar 

  7. Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: Proceedings of 20th ACM international Conference multimedia (MM), New York, NY, USA

  8. Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H (2020) Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA

  9. Li C, Xie C, Zhang B, Han J, Zhen X, Chen J (2022) Memory attention networks for skeleton-based action recognition. IEEE Transactions on Neural Networks and Learning Systems 33(9):4800–4814

    Article  Google Scholar 

  10. Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  Google Scholar 

  11. Haghighat M, Abdel-Mottaleb M, Alhalabi W (2016) Discriminant correlation analysis: Real-time feature level fusion for multimodal biometric recognition. IEEE Trans Inf Forensics Secur 11(9):1984–1996

    Article  Google Scholar 

  12. Wang K, He R, Wang L, Wang W, Tan T (2016) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 38(10):2010–2023

    Article  Google Scholar 

  13. Li C, Huang Q, Li X, Wu Q (2021) Human action recognition based on multi-scale feature maps from depth video sequences. Multimedia Tools and Applications 80:32111–32130

    Article  Google Scholar 

  14. Li X, Hou Z, Liang J, Chen C (2020) Human action recognition based on 3D body mask and depth spatial-temporal maps. Multimedia Tools Application 79:35761–35778

    Article  Google Scholar 

  15. Liu X, Li Y, Wang Q (2018) Multi-view hierarchical bidirectional recurrent neural network for depth video sequence based action recognition. Int J Pattern Recognit Artif Intell 32(10):1850033

    Article  Google Scholar 

  16. Tasnim N, Baek JH (2022) Deep learning-based human action recognition with key-frames sampling using ranking methods. Appl Sci 12(9):4165

    Article  Google Scholar 

  17. Sánchez-Caballero A, Fuentes-Jiménez D, Losada-Gutiérrez C (2023) Real-time human action recognition using raw depth video-based recurrent neural networks. Multimed Tool Appl 82:16213–16235

    Article  Google Scholar 

  18. Ding C, Liu K, Cheng F, Belyaev E (2021) Spatio-temporal attention on manifold space for 3D human action recognition. Appl Intell 51:560–570

    Article  Google Scholar 

  19. Zhang C, Liang J, Li X, Xia Y, Di L, Hou Z, Huan Z (2022) Human action recognition based on enhanced data guidance and key node spatial temporal graph convolution. Multimed Tool Appl 81:8349–8366

    Article  Google Scholar 

  20. Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of IEEE/CVF conference on computer vision and pattern recognition (CVPR), Los Angeles, California, USA

  21. Liu J, Wang G, Duan L, Abdiyeva K, Kot AC (2018) Skeleton-based human action recognition with global context-aware attention lstm networks. IEEE Trans Image Process 27(4):1586–1599

    Article  MathSciNet  Google Scholar 

  22. Plizzari C, Cannici M, Matteucci M (2021) Spatial temporal transformer network for skeleton-based action recognition. In: Proceedings of international conference on pattern recognition, Milan, Italy

  23. Hou Y, Li Z, Wang P, Li W (2018) Skeleton optical spectra based action recognition using convolutional neural networks. IEEE Trans Circuits Syst Video Technol 28(3):807–811

  24. Chao X, Hou Z, Liang J, Yang T (2020) Integrally cooperative spatio-temporal feature representation of motion joints for action recognition. Sensors 20(18):1–22

    Article  Google Scholar 

  25. Guo D, Xu W, Qian Y, Ding W (2023) M-FCCL: memory-based concept-cognitive learning for dynamic fuzzy data classification and knowledge fusion. Inform Fusion 100:101962

    Article  Google Scholar 

  26. Xu W, Guo D, Qian Y, Ding W (2023) Two-way concept-cognitive learning method: a fuzzy-based progressive learning. IEEE Trans Fuzzy Syst 31(6):1885–1899

    Article  Google Scholar 

  27. Guo D, Xu W, Qian Y, Ding W (2023) Fuzzy-granular concept-cognitive learning via three-way decision: performance evaluation on dynamic knowledge discovery. IEEE Trans Fuzzy Syst, Early Access

  28. Guo D, Xu W (2023) Fuzzy-based concept-cognitive learning: an investigation of novel approach to tumor diagnosis analysis. Inform Fusion 639:118998

    Google Scholar 

  29. Wu Z, Wan S, Yan L, Yue L (2018) Autoencoder-based feature learning from a 2D depth map and 3D skeleton for action recognition. J Comput 29(4):82–95

    Google Scholar 

  30. Zhang E, Xue B, Cao F, Duan J, Lin G, Lei Y (2019) Fusion of 2D CNN and 3D densenet for dynamic gesture recognition. Electronics 8(12):1–15

    Article  Google Scholar 

  31. Dawar N, Kehtarnavaz N (2018) Real-time continuous detection and recognition of subject-specific smart tv gestures via fusion of depth and inertial sensing. IEEE Access 6:7019–7028

  32. Liu Z, Pan X, Li Y, Chen Z (2020) A game theory based CTU-level bit allocation scheme for HEVC region of interest coding. IEEE Trans Image Process 30:794–805

    Article  MathSciNet  Google Scholar 

  33. He R, Tan T, Wang L, Zheng W (2012) \(l_{21}\) regularized correntropy for robust feature selection. In: Proceedings of 2012 IEEE conference on computer vision and pattern recognition, Providence, RI, USA

  34. Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: Proceedings of 2010 IEEE computer society conference on computer vision and pattern recognition-workshops, San Francisco, CA, USA

  35. Chen C, Jafari R, Kehtarnavaz N (2015) UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: Proceedings of IEEE international conference on image processing (ICIP), Quebec City, QC, Canada

  36. Chao X, Hou Z, Mo Y (2022) CZU-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and 10 wearable inertial sensors. IEEE Sens J 22(7):7034–7042

    Article  Google Scholar 

  37. Chen C, Jafari R, Kehtarnavaz N (2015) Action recognition from depth sequences using depth motion maps-based local binary patterns. In: Proceedings of 2015 IEEE winter conference on applications of computer vision, waikoloa, HI, USA

  38. Min Y, Zhang Y, Chai X, Chen X (2020) An efficient PointLSTM for point clouds based gesture recognition. In: Proceedings of IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA

  39. Li X, Huang Q, Wang Z (2023) Spatial and temporal information fusion for human action recognition via Center Boundary Balancing Multimodal Classifier. J Vis Commun Image Represent 90:103716

    Article  Google Scholar 

  40. Tasnim N, Islam MM, Baek JH (2020) Deep learning-based action recognition using 3D skeleton joints information. Inventions 5(3):49

    Article  Google Scholar 

  41. Liu M, Yuan J (2018) Recognizing human actions as the evolution of pose estimation maps. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, Utah, USA

  42. Memmesheimer R, Theisen N, Paulus D (2020) Gimme signals: discriminative signal encoding for multimodal activity recognition. In: Proceedings of IEEE/RSJ international conference on intelligent robots and systems, Las Vegas, NV, USA

  43. Zhao R, Xu W, Su H, Ji Q (2019) Bayesian hierarchical dynamic model for human action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, California, USA

  44. Wang P, Li W, Li C, Hou Y (2018) Action recognition based on joint trajectory maps with convolutional neural networks. Knowl-Based Syst 158:43–53

    Article  Google Scholar 

  45. Shi H, Hou Z, Liang J, Lin E, Zhong Z (2023) DSFNet: a distributed sensors fusion network for action recognition. IEEE Sens J 23(1):839–848

    Article  Google Scholar 

  46. Chao X, Hou Z, Mo Y, Shi H, Yao W (2023) Structural feature representation and fusion of human spatial cooperative motion for action recognition. Multimedia Syst 29:1301–1314

Download references

Acknowledgements

This work was supported by the National Science Foundation of China under Grant No.41971343, and Jiangsu Graduate Student Research Innovation Program under Grant No.KYCX23_1693.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, Xin Chao, Genlin Ji, and Xiaosha Qi. Methodology, Xin Chao, Genlin Ji. Validation, Xin Chao. Writing original draft preparation, Xin Chao. Funding acquisition, Xin Chao and Genlin Ji. All authors have read and agreed to the version of this paper.

Corresponding author

Correspondence to Genlin Ji.

Ethics declarations

Competing Interests

The authors declare that they have no conflict of interest to this work.

Ethical and informed consent for data used

The data used in this paper are all published public datasets.

Data availability and access

This paper has no associated data.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chao, X., Ji, G. & Qi, X. Multi-view key information representation and multi-modal fusion for single-subject routine action recognition. Appl Intell 54, 3222–3244 (2024). https://doi.org/10.1007/s10489-024-05319-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05319-y

Keywords

Navigation