Skip to main content
Log in

RGB+2D skeleton: local hand-crafted and 3D convolution feature coding for action recognition

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Most 3D skeleton feature-based human action recognition methods are sensitive to changes in viewpoints, motion scales, and human scales. In addition, acquiring depth information from a real scene in outdoor environments results in poor precision or high computational costs. To address these drawbacks, in this study, we propose a new RGB video and 2D skeleton-based action recognition method including local joint trajectory volume representation and feature coding. First, a video is transferred to a set of volumes, which are called the local joint trajectory volumes. Then, hand-crafted and convolutional networks are used to calculate the features of each volume-based RGB image sequence. Different from most works that use convolutional networks to learn global video features, in this paper, the problem of using a convolutional network to represent local video regions is discussed. Finally, the feature set of each joint is encoded into the Fisher vector as an action feature. The classifier is trained by a linear SVM. The experimental results show that skeleton joint-based features result in a more compact and effective action representation approach than other approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Wang, J., Chen, Y., Hao, S., Peng, X., Hu, L.: Deep learning for sensor-based activity recognition: a survey. Pattern Recognit. Lett. 119, 3–11 (2019)

    Article  Google Scholar 

  2. Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: a survey. Image Visiom Comput. 60, 4–21 (2017)

    Article  Google Scholar 

  3. Zhang, H.B., Zhang, Y.X., Zhong, B., Lei, Q., Yang, L., Du, J.X., Chen, D.S.: A comprehensive survey of vision-based human action recognition methods. Sensors 19(5), (2019)

  4. DasDawn, D., Shaikh, S.: A comprehensive survey of human action recognition with spatio-temporal interest point (stip) detector. Visual Comput., 32(3) (2016)

  5. Presti, L.L., Cascia, M.L.: 3D skeleton-based human action classification: a survey. Pattern Recognit. 53, 130–147 (2016)

    Article  Google Scholar 

  6. Wang, P., Li, W., Ogunbona, P., Wan, J., Escalera, S.: Rgb-d-based human motion recognition with deep learning: a survey. Comput. Vis. Image Underst. 171, 118–139 (2018)

    Article  Google Scholar 

  7. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. 36(5), 914–927 (2014)

    Article  Google Scholar 

  8. Li, M., Leung, H.: Graph-based approach for 3D human skeletal action recognition. Pattern Recognit. Lett. 87, 195–202 (2017)

    Article  Google Scholar 

  9. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4489–4497 (2015)

  10. Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Sequence of the most informative joints (SMIJ): a new representation for human skeletal action recognition. J. Vis. Commun. Image Represent. 25(1), 24–38 (2014)

    Article  Google Scholar 

  11. Wu, D., Shao, L.: Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–731 (2014)

  12. Liu, X., Li, Y., Xia, R.: Rotation-based spatial-temporal feature learning from skeleton sequences for action recognition. Signal Image Video Process. 14(4), 1227–1234 (2020)

    Article  Google Scholar 

  13. Xia, L., Chen, C., Aggarwal, J. K.: View invariant human action recognition using histograms of 3D joints. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–27 (2012)

  14. Kerola, T., Inoue, N., Shinoda, K.: Spectral graph skeletons for 3D action recognition. In: Asia Conference on Computer Vision, pp. 417–432. Springer International Publishing, Cham (2015)

  15. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI Conference on Artificial Intelligence, pp. 7444–7452. New Orleans, LA, United states (2018)

  16. Huang, Q., Zhou, F., Qin, R., Zhao, Y.: View transform graph attention recurrent networks for skeleton-based action recognition. Signal Image Video Process. (2020)

  17. Rahmani, H., Mahmood, A., Huynh, D.Q, Mian, A.: Hopc: Histogram of oriented principal components of 3D pointclouds for action recognition. In Asia Conference on Computer Vision, pp. 742–757. Springer International Publishing, Cham (2014)

  18. Nazir, S., Yousaf, M.H., Velastin, S.A.: Evaluating a bag-of-visual features approach using spatio-temporal features for action recognition. Comput. Electr. Eng. 72, 660–669 (2018)

    Article  Google Scholar 

  19. Dedeoğlu, Y., Töreyin, B., Güdükbay, U., Çetin, A.E.: Silhouette-based method for object classification and human action recognition in video. In: International Conference on Human–Computer Interaction, pp 64–77. Springer-Verlag, Berlin, Heidelberg (2006)

  20. Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)

    Article  Google Scholar 

  21. Pei, L., Ye, M., Zhao, X., Xiang, T., Li, T.: Learning spatio-temporal features for action recognition from the side of the video. Signal Image Video Process. 10(1), 199–206 (2016)

    Article  Google Scholar 

  22. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36. Springer International Publishing, Cham (2016)

  23. Shahroudy, A., Ng, T., Gong, Y., Wang, G.: Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Trans. Pattern Anal. 40(5), 1045–1058 (2018)

    Article  Google Scholar 

  24. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3551–3558 (2013)

  25. Liu, L., Shao, L.: Learning discriminative representations from RGB-D video data. In: International Joint Conferences on Artificial Intelligence, IJCAI 13, p. 1493-C1500. AAAI Press (2013)

  26. Wang, J., Yuan, J., Chen, Z., Wu, Y.: Spatial Locality-Aware Sparse Coding and Dictionary Learning, vol. 25, pp. 491–505. Singapore (2012)

  27. Oreifej, O., Liu, Z.: Hon4d: Histogram of oriented 4D normals for activity recognition from depth sequences. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 716–723 (2013)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong-Bo Zhang.

Ethics declarations

Conflicts of interest

All of authors have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is supported by the Natural Science Foundation of China (Nos. 61871196, 61673186, 61972167 and 62001176), National Key Research and Development Program of China (No. 2019YFC1604700), Natural Science Foundation of Fujian Province of China (Nos. 2019J01082 and 2020J01085), the Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (ZQN-YX601, ZQN-710)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, YX., Zhang, HB., Du, JX. et al. RGB+2D skeleton: local hand-crafted and 3D convolution feature coding for action recognition. SIViP 15, 1379–1386 (2021). https://doi.org/10.1007/s11760-021-01868-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-021-01868-8

Keywords

Navigation