Skip to main content

Skeleton-Based Human Action Recognition on Large-Scale Datasets

Part of the Intelligent Systems Reference Library book series (ISRL,volume 207)

Abstract

Skeleton-based Human Action Recognition (SHAR) is one of the most trending research topics in computer vision, which relies on the investigation of multi-modal data acquired from different sensory devices. Due to its faster execution speed, skeleton-based automated SHAR systems are widely adopted in real-time applications such as surveillance systems, behavior analysis, gesture recognition, and security systems. To build such an efficient system, the recognition model needs to be trained on a large-scale multi-modal dataset to accurately identify multi-class actions. However, the recognition of multi-class actions with higher accuracy requires the extraction of the spatio-temporal discriminative features, which is a challenging task. Moreover, the traditional handcrafted feature-based Machine Learning (ML) models when dealing with a large dataset, fail to preserve spatio-temporal correlations, resulting in poor recognition accuracy. While exploring the recent literature, a paradigm shift in SHAR research from traditional ML-based models to deep learning models is observed, which intrigues us to hypothesize that deep learning is the future avenue for SHAR research on large datasets. To establish it, we perform an exhaustive study on the recent works published in 2017 onwards that utilize deep learning models in SHAR on large skeletal datasets, notably alleviating the tedious task of feature engineering and extraction. The comparative performance analysis of the recent models shows that the successor models outperform its predecessors in terms of recognition accuracy, confirming the prospects of the deep learning-based SHAR research on large skeletal datasets.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-75490-7_5
  • Chapter length: 22 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   169.00
Price excludes VAT (USA)
  • ISBN: 978-3-030-75490-7
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   219.99
Price excludes VAT (USA)
Hardcover Book
USD   219.99
Price excludes VAT (USA)
Fig. 5.1
Fig. 5.2

References

  1. Mokari, M., Mohammadzade, H., Ghojogh, B.: Recognizing involuntary actions from 3D skeleton data using body states. arXiv preprint arXiv:1708.06227 (2017)

  2. Weng, J., Weng, C., Yuan, J.: Spatio-temporal Naive-Bayes nearest-neighbor (ST-NBNN) for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4171–4180 (2017)

    Google Scholar 

  3. Asadi-Aghbolaghi, M., Bertiche, H., Roig, V., Kasaei, S., Escalera, S.: Action recognition from RGB-D data: comparison and fusion of spatio-temporal handcrafted features and deep strategies. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 3179–3188 (2017)

    Google Scholar 

  4. Dang, L.M., Min, K., Wang, H., Piran, M.J., Lee, C.H., Moon, H.: Sensor-based and vision-based human activity recognition: a comprehensive survey. Pattern Recogn. 108, 107561 (2020)

    CrossRef  Google Scholar 

  5. Jegham, I., Khalifa, A.B., Alouani, I., Mahjoub, M.A.: Vision-based human action recognition: an overview and real world challenges. Forensic Sci. Int.: Digit. Invest. 32, 200901 (2020)

    Google Scholar 

  6. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  7. Gaglio, S., Re, G.L., Morana, M.: Human activity recognition process using 3-D posture data. IEEE Trans. Hum.-Mach. Syst. 45(5), 586–597 (2014)

    CrossRef  Google Scholar 

  8. Xia, L., Chen, C.C., Aggarwal, J.K.: View invariant human action recognition using histograms of 3D joints. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 20–27. IEEE (2012)

    Google Scholar 

  9. Seidenari, L., Varano, V., Berretti, S., Bimbo, A., Pala, P.: Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 479–485 (2013)

    Google Scholar 

  10. Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, June 2016

    Google Scholar 

  11. Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., Kot, A.C.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2684–2701 (2019)

    CrossRef  Google Scholar 

  12. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  13. Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008 (2018)

  14. Li, B., Li, X., Zhang, Z., Fei, W.: Spatio-temporal graph routing for skeleton-based action recognition. Proc. AAAI Conf. Artif. Intell. 33, 8561–8568 (2019)

    Google Scholar 

  15. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3595–3603 (2019)

    Google Scholar 

  16. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)

    Google Scholar 

  17. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7912–7921 (2019)

    Google Scholar 

  18. Zhang, X., Xu, C., Tao, D.: Context aware graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14333–14342 (2020)

    Google Scholar 

  19. Cho, S., Maqbool, M., Liu, F., Foroosh, H.: Self-attention network for skeleton-based human action recognition. In: The IEEE Winter Conference on Applications of Computer Vision, pp. 635–644 (2020)

    Google Scholar 

  20. Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143–152 (2020)

    Google Scholar 

  21. Yang, W., Zhang, J., Cai, J., Zhiyong, X.: Shallow graph convolutional network for skeleton-based action recognition. Sensors 21(2), 452 (2021)

    CrossRef  Google Scholar 

  22. Xie, J., et al.: Cross-channel graph convolutional networks for skeleton-based action recognition. IEEE Access 9, 9055–9065 (2021)

    CrossRef  Google Scholar 

  23. Ahmad, T., Jin, L., Lin, L., Tang, G.Z.: Skeleton-based action recognition using sparse spatio-temporal GCN with edge effective resistance. Neurocomputing 423, 389–398 (2021)

    CrossRef  Google Scholar 

  24. Xia, H., Gao, X.: Multi-scale mixed dense graph convolution network for skeleton-based action recognition. IEEE Access 9, 36475–36484 (2021)

    CrossRef  Google Scholar 

  25. Cai, J., Jiang, N., Han, X., Jia, K., Lu, J.: JOLO-GCN: mining joint-centered light-weight information for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2735–2744 (2021)

    Google Scholar 

  26. Ji, Y., Xu, F., Yang, Y., Shen, F., Shen, H.T., Zheng, W.-S.: A large-scale varying-view RGB-D action dataset for arbitrary-view human action recognition. arXiv preprint arXiv:1904.10681 (2019)

  27. Capecci, M., et al.: The KIMORE dataset: kinematic assessment of movement and clinical scores for remote monitoring of physical rehabilitation. IEEE Trans. Neural Syst. Rehabil. Eng. 27(7), 1436–1448 (2019)

    CrossRef  Google Scholar 

  28. Rahmani, H., Mahmood, A., Huynh, D., Mian, A.: Histogram of oriented principal components for cross-view action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(12), 2430–2443 (2016)

    CrossRef  Google Scholar 

  29. Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: a large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475 (2017)

  30. Kong, Q., Wu, Z., Deng, Z., Klinkigt, M., Tong, B., Murakami, T.: MMAct: a large-scale dataset for cross modal human action understanding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8658–8667 (2019)

    Google Scholar 

  31. Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 9–14. IEEE (2010)

    Google Scholar 

  32. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290–1297. IEEE (2012)

    Google Scholar 

  33. Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: 2015 IEEE International conference on image processing (ICIP), pp. 168–172. IEEE (2015)

    Google Scholar 

  34. Wang, J., Nie, X., Xia, Y., Wu, Y., Zhu, S.-C.: Cross-view action modeling, learning and recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2649–2656 (2014)

    Google Scholar 

  35. Zhu, W., et al.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)

    Google Scholar 

  36. Li, S., Jiang, T., Huang, T., Tian, Y.: Global co-occurrence feature learning and active coordinate system conversion for skeleton-based action recognition. In: The IEEE Winter Conference on Applications of Computer Vision, pp. 586–594 (2020)

    Google Scholar 

  37. Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)

    Google Scholar 

  38. Zhao, R., Wang, K., Su, H., Ji, Q.: Bayesian graph convolution LSTM for skeleton based action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6882–6892 (2019)

    Google Scholar 

  39. Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1227–1236 (2019)

    Google Scholar 

  40. Si, C., Jing, Y., Wang, W., Wang, L., Tan, T.: Skeleton-based action recognition with spatial reasoning and temporal stack learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 103–118 (2018)

    Google Scholar 

  41. Liu, J., Shahroudy, A., Xu, D., Kot, A.C., Wang, G.: Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 3007–3021 (2017)

    CrossRef  Google Scholar 

  42. Huang, J., Xiang, X., Gong, X., Zhang, B., et al.: Long-short graph memory network for skeleton-based action recognition. In: The IEEE Winter Conference on Applications of Computer Vision, pp. 645–652 (2020)

    Google Scholar 

  43. Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2117–2126 (2017)

    Google Scholar 

  44. Lee, I., Kim, D., Kang, S., Lee, S.: Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1012–1020 (2017)

    Google Scholar 

  45. Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Learning clip representations for skeleton-based 3D action recognition. IEEE Trans. Image Process. 27(6), 2842–2855 (2018)

    MathSciNet  CrossRef  Google Scholar 

  46. Luvizon, D., Picard, D., Tabia, H.: Multi-task deep learning for real-time 3D human pose estimation and action recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2020)

    Google Scholar 

  47. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp. 3686–3693 (2014)

    Google Scholar 

  48. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)

    Google Scholar 

  49. Morais, R., Le, V., Tran, T., Saha, B., Mansour, M., Venkatesh, S.: Learning regularity in skeleton trajectories for anomaly detection in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11996–12004 (2019)

    Google Scholar 

  50. Tang, Y., Tian, Y., Lu, J., Li, P., Zhou, J.: Deep progressive reinforcement learning for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5323–5332 (2018)

    Google Scholar 

  51. Baek, S., Kim, K.I., Kim, T.-K.: Augmented skeleton space transfer for depth-based hand pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8330–8339 (2018)

    Google Scholar 

  52. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)

    Google Scholar 

  53. Gammulle, H., Denman, S., Sridharan, S., Fookes, C.: Fine-grained action segmentation using the semi-supervised action GAN. Pattern Recogn. 98, 107039 (2020)

    CrossRef  Google Scholar 

  54. Lv, F., Nevatia, R.: Recognition and segmentation of 3-D human action using HMM and multi-class AdaBoost. In: European Conference on Computer Vision, pp. 359–372. Springer (2006)

    Google Scholar 

  55. Wu, D., Shao, L.: Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–731 (2014)

    Google Scholar 

  56. Rahmani, H., Bennamoun, M.: Learning action recognition model from depth and skeleton videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5832–5841 (2017)

    Google Scholar 

  57. Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017)

    Google Scholar 

  58. Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1963–1978 (2019)

    CrossRef  Google Scholar 

  59. Nie, Q., Wang, J., Wang, X., Liu, Y.: View-invariant human action recognition based on a 3D bio-constrained skeleton model. IEEE Trans. Image Process. 28(8), 3959–3972 (2019)

    MathSciNet  CrossRef  Google Scholar 

  60. Su, K., Liu, X., Shlizerman, E.: Predict & cluster: unsupervised skeleton based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9631–9640 (2020)

    Google Scholar 

  61. Tian, D., Lu, Z.-M., Chen, X., Ma, L.-H.: An attentional spatial temporal graph convolutional network with co-occurrence feature learning for action recognition. Multimed. Tools Appl. 2020, 1–19 (2020)

    Google Scholar 

  62. Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 183–192 (2020)

    Google Scholar 

  63. Main de Boissiere, A., Noumeir, R.: Infrared and 3D skeleton feature fusion for RGB-D action recognition. arXiv preprint arXiv:2002.12886 (2020)

  64. Dong, J., et al.: Action recognition based on the fusion of graph convolutional networks with high order features. Appl. Sci. 10(4), 1482 (2020)

    CrossRef  Google Scholar 

  65. Wang, H., Baosheng, Yu., Xia, K., Li, J., Zuo, X.: Skeleton edge motion networks for human action recognition. Neurocomputing 423, 1–12 (2021)

    CrossRef  Google Scholar 

  66. Liu, J., Wang, G., Hu, P., Duan, L.-Y., Kot, A.C.: Global context-aware attention LSTM networks for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1647–1656 (2017)

    Google Scholar 

  67. Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn. 68, 346–362 (2017)

    CrossRef  Google Scholar 

  68. Jian-Fang, H., Zheng, W.-S., Ma, L., Wang, G., Lai, J., Zhang, J.: Early action prediction by soft regression. IEEE Trans. Pattern Anal. Mach. Intell. 41(11), 2568–2583 (2018)

    Google Scholar 

  69. Liu, J., Shahroudy, A., Wang, G., Duan, L.-Y., Kot, A.C.: Skeleton-based online action prediction using scale selection network. IEEE Trans. Pattern Anal. Mach. Intell. 42(6), 1453–1467 (2019)

    CrossRef  Google Scholar 

  70. Papadopoulos, K., Ghorbel, E., Aouada, D., Ottersten, B.: Vertex feature encoding and hierarchical temporal modeling in a spatial-temporal graph convolutional network for action recognition. arXiv preprint arXiv:1912.09745 (2019)

  71. Huynh-The, T., Hua, C.-H., Tu, N.A., Kim, D.-S.: Learning geometric features with dual–stream CNN for 3D action recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2353–2357. IEEE (2020)

    Google Scholar 

  72. Kong, Y., Fu, Y.: Bilinear heterogeneous information machine for RGB-D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1054–1062 (2015)

    Google Scholar 

  73. Li, X., Zhang, Y., Zhang, J.: Improved key poses model for skeleton-based action recognition. In: Pacific Rim Conference on Multimedia, pp. 358–367. Springer (2017)

    Google Scholar 

  74. Zhang, X., Xu, C., Tian, X., Tao, D.: Graph edge convolutional neural networks for skeleton-based action recognition. IEEE Trans. Neural Netw. Learn. Syst. 31(8), 3047–3060 (2019)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md Atiqur Rahman Ahad .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Verify currency and authenticity via CrossMark

Cite this chapter

Hossain, T., Sarker, S., Rahman, S., Ahad, M.A.R. (2021). Skeleton-Based Human Action Recognition on Large-Scale Datasets. In: Ahad, M.A.R., Inoue, A. (eds) Vision, Sensing and Analytics: Integrative Approaches. Intelligent Systems Reference Library, vol 207. Springer, Cham. https://doi.org/10.1007/978-3-030-75490-7_5

Download citation