Skip to main content
Log in

Skeleton-based human action recognition with sequential convolutional-LSTM networks and fusion strategies

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Human action recognition from skeleton data has drawn a lot of attention from researchers due to the availability of thousands of real videos with many challenges. Existing works attempted to model the spatial characteristics and temporal dependencies of 3D joints using dynamic time warping, hand-crafted, and spatial co-occurrence features. However, the representation derived from the spatial stream overemphasizes the temporal information; thus, it yields limited expressive power. Some studies use skeleton sequences as frames to enhance the expressive power of representations but lose the generalization capability because the derived temporal smoothness is specific to a particular dataset. The proposed work uses joint distance maps as a base representation that encodes the spatial and temporal information to color texture images. We increase the expressive power by extracting the feature maps from pre-trained networks on ImageNet to diversify the texture representation and propose a network architecture to model the temporal dependency explicitly. We also explore various fusion strategies to generate diverse representations from the feature maps of the pre-trained networks. The experimental results show that the proposed method achieves the best recognition accuracy when using decision-level fusion with meta-learners (Random Forest). The analysis also reveals that the use of feature-level fusion yields relatively good results in terms of the trade-off, i.e., on par recognition performance with some decision-level fusion strategies while having less tunable parameters. Extensive experimental results and comparative analysis on three benchmark datasets prove that the proposed representation and network not only yield better recognition accuracy but also exhibit stronger generalization capability on multiple datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Availability of data and materials

The datasets analyzed in this study are included in the published articles (Wang et al. 2012; Seidenari et al. 2013; Shahroudy et al. 2016a, b).

References

Download references

Acknowledgements

This work was supported by Hankuk University of Foreign Studies Research Fund and by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2021R1F1A1060244).

Funding

This work was supported by Hankuk University of Foreign Studies Research Fund and by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2021R1F1A1060244).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seok-Lyong Lee.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khowaja, S.A., Lee, SL. Skeleton-based human action recognition with sequential convolutional-LSTM networks and fusion strategies. J Ambient Intell Human Comput 13, 3729–3746 (2022). https://doi.org/10.1007/s12652-022-03848-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-022-03848-3

Keywords

Navigation