Abstract
Most 3D skeleton feature-based human action recognition methods are sensitive to changes in viewpoints, motion scales, and human scales. In addition, acquiring depth information from a real scene in outdoor environments results in poor precision or high computational costs. To address these drawbacks, in this study, we propose a new RGB video and 2D skeleton-based action recognition method including local joint trajectory volume representation and feature coding. First, a video is transferred to a set of volumes, which are called the local joint trajectory volumes. Then, hand-crafted and convolutional networks are used to calculate the features of each volume-based RGB image sequence. Different from most works that use convolutional networks to learn global video features, in this paper, the problem of using a convolutional network to represent local video regions is discussed. Finally, the feature set of each joint is encoded into the Fisher vector as an action feature. The classifier is trained by a linear SVM. The experimental results show that skeleton joint-based features result in a more compact and effective action representation approach than other approaches.
Similar content being viewed by others
References
Wang, J., Chen, Y., Hao, S., Peng, X., Hu, L.: Deep learning for sensor-based activity recognition: a survey. Pattern Recognit. Lett. 119, 3–11 (2019)
Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: a survey. Image Visiom Comput. 60, 4–21 (2017)
Zhang, H.B., Zhang, Y.X., Zhong, B., Lei, Q., Yang, L., Du, J.X., Chen, D.S.: A comprehensive survey of vision-based human action recognition methods. Sensors 19(5), (2019)
DasDawn, D., Shaikh, S.: A comprehensive survey of human action recognition with spatio-temporal interest point (stip) detector. Visual Comput., 32(3) (2016)
Presti, L.L., Cascia, M.L.: 3D skeleton-based human action classification: a survey. Pattern Recognit. 53, 130–147 (2016)
Wang, P., Li, W., Ogunbona, P., Wan, J., Escalera, S.: Rgb-d-based human motion recognition with deep learning: a survey. Comput. Vis. Image Underst. 171, 118–139 (2018)
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. 36(5), 914–927 (2014)
Li, M., Leung, H.: Graph-based approach for 3D human skeletal action recognition. Pattern Recognit. Lett. 87, 195–202 (2017)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4489–4497 (2015)
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Sequence of the most informative joints (SMIJ): a new representation for human skeletal action recognition. J. Vis. Commun. Image Represent. 25(1), 24–38 (2014)
Wu, D., Shao, L.: Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–731 (2014)
Liu, X., Li, Y., Xia, R.: Rotation-based spatial-temporal feature learning from skeleton sequences for action recognition. Signal Image Video Process. 14(4), 1227–1234 (2020)
Xia, L., Chen, C., Aggarwal, J. K.: View invariant human action recognition using histograms of 3D joints. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–27 (2012)
Kerola, T., Inoue, N., Shinoda, K.: Spectral graph skeletons for 3D action recognition. In: Asia Conference on Computer Vision, pp. 417–432. Springer International Publishing, Cham (2015)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI Conference on Artificial Intelligence, pp. 7444–7452. New Orleans, LA, United states (2018)
Huang, Q., Zhou, F., Qin, R., Zhao, Y.: View transform graph attention recurrent networks for skeleton-based action recognition. Signal Image Video Process. (2020)
Rahmani, H., Mahmood, A., Huynh, D.Q, Mian, A.: Hopc: Histogram of oriented principal components of 3D pointclouds for action recognition. In Asia Conference on Computer Vision, pp. 742–757. Springer International Publishing, Cham (2014)
Nazir, S., Yousaf, M.H., Velastin, S.A.: Evaluating a bag-of-visual features approach using spatio-temporal features for action recognition. Comput. Electr. Eng. 72, 660–669 (2018)
Dedeoğlu, Y., Töreyin, B., Güdükbay, U., Çetin, A.E.: Silhouette-based method for object classification and human action recognition in video. In: International Conference on Human–Computer Interaction, pp 64–77. Springer-Verlag, Berlin, Heidelberg (2006)
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)
Pei, L., Ye, M., Zhao, X., Xiang, T., Li, T.: Learning spatio-temporal features for action recognition from the side of the video. Signal Image Video Process. 10(1), 199–206 (2016)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36. Springer International Publishing, Cham (2016)
Shahroudy, A., Ng, T., Gong, Y., Wang, G.: Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Trans. Pattern Anal. 40(5), 1045–1058 (2018)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3551–3558 (2013)
Liu, L., Shao, L.: Learning discriminative representations from RGB-D video data. In: International Joint Conferences on Artificial Intelligence, IJCAI 13, p. 1493-C1500. AAAI Press (2013)
Wang, J., Yuan, J., Chen, Z., Wu, Y.: Spatial Locality-Aware Sparse Coding and Dictionary Learning, vol. 25, pp. 491–505. Singapore (2012)
Oreifej, O., Liu, Z.: Hon4d: Histogram of oriented 4D normals for activity recognition from depth sequences. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 716–723 (2013)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
All of authors have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work is supported by the Natural Science Foundation of China (Nos. 61871196, 61673186, 61972167 and 62001176), National Key Research and Development Program of China (No. 2019YFC1604700), Natural Science Foundation of Fujian Province of China (Nos. 2019J01082 and 2020J01085), the Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (ZQN-YX601, ZQN-710)
Rights and permissions
About this article
Cite this article
Zhang, YX., Zhang, HB., Du, JX. et al. RGB+2D skeleton: local hand-crafted and 3D convolution feature coding for action recognition. SIViP 15, 1379–1386 (2021). https://doi.org/10.1007/s11760-021-01868-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-021-01868-8