Skip to main content

Quo Vadis, Skeleton Action Recognition?

Abstract

In this paper, we study current and upcoming frontiers across the landscape of skeleton-based human action recognition. To study skeleton-action recognition in the wild, we introduce Skeletics-152, a curated and 3-D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset. We extend our study to include out-of-context actions by introducing Skeleton-Mimetics, a dataset derived from the recently introduced Mimetics dataset. We also introduce Metaphorics, a dataset with caption-style annotated YouTube videos of the popular social game Dumb Charades and interpretative dance performances. We benchmark state-of-the-art models on the NTU-120 dataset and provide multi-layered assessment of the results. The results from benchmarking the top performers of NTU-120 on the newly introduced datasets reveal the challenges and domain gap induced by actions in the wild. Overall, our work characterizes the strengths and limitations of existing approaches and datasets. Via the introduced datasets, our work enables new frontiers for human action recognition.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

References

  1. Angelini, F., Fu, Z., Long, Y., Shao, L., & Naqvi, S. M. (2018). Actionxpose: A novel 2d multi-view pose-based algorithm for real-time human action recognition.

  2. Ayumi, V. (2016). Pose-based human action recognition with extreme gradient boosting. In 2016 IEEE student conference on research and development (SCOReD), pp. 1–5. IEEE.

  3. Caetano, C., Brémond, F., Schwartz, W. R. (2019). Skeleton image representation for 3d action recognition based on tree structure and reference joints. In Conference on graphics, patterns and images (SIBGRAPI).

  4. Caetano, C., Sena, J., Brémond, F., dos Santos, J. A., & Schwartz, W. R. (2019). Skelemotion: A new representation of skeleton joint sequences based on motion information for 3d action recognition. In IEEE international conference on advanced video and signal-based surveillance (AVSS).

  5. Cai, Y., Li, H., Hu, J. F., & Zheng, W. S. (2019). Action knowledge transfer for action prediction with partial videos. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8118–8125. https://doi.org/10.1609/aaai.v33i01.33018118.

    Article  Google Scholar 

  6. Cao, Y., Barrett, D., Barbu, A., Narayanaswamy, S., Yu, H., Michaux, A., Lin, Y., Dickinson, S., Siskind, J., & Wang, S. (2013). Recognize human activities from partially observed videos. In: Proceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society conference on computer vision and pattern recognition, pp. 2658–2665. https://doi.org/10.1109/CVPR.2013.343.

  7. Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., & Sheikh, Y. A. (2019). Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  8. Carreira, J., Noland, E., Hillier, C., Zisserman, A. (2019). A short note on the kinetics-700 human action dataset. CoRR arXiv:1907.06987.

  9. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp. 4724–4733.

  10. Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., & Lu, H. (2020). Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  11. Chéron, G., Laptev, I., Schmid, C. (2015). P-cnn: Pose-based cnn features for action recognition. In Proceedings of the IEEE international conference on computer vision, pp. 3218–3226.

  12. Chunhui, L., Yueyu, H., Yanghao, L., Sijie, S., & Jiaying, L. (2017). Pku-mmd: A large scale benchmark for continuous multi-modal human action understanding. ACM multimedia workshop.

  13. Du, Y., Fu, Y., & Wang, L. (2015). Skeleton based action recognition with convolutional neural network. In 2015 3rd IAPR Asian conference on pattern recognition (ACPR), pp. 579–583. https://doi.org/10.1109/ACPR.2015.7486569

  14. Eweiwi, A., Cheema, M. S., Bauckhage, C., & Gall, J. (2014). Efficient pose-based action recognition. In Asian conference on computer vision (pp. 428–443). Springer.

  15. Hahn, M., Silva, A., & Rehg, J. M. (2019). Action2vec: A crossmodal embedding approach to action learning. arXiv preprint arXiv:1901.00484.

  16. Halim, A. A., Dartigues-Pallez, C., Precioso, F., Riveill, M., Benslimane, A., & Ghoneim, S. (2016). Human action recognition based on 3d skeleton part-based pose estimation and temporal multi-resolution analysis. In 2016 IEEE international conference on image processing (ICIP) (pp. 3041–3045). IEEE.

  17. Hussein, M., Torki, M., Gowayyed, M., & El Saban, M. (2013). Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In IJCAI international joint conference on artificial intelligence.

  18. Jasani, B., & Mazagonwalla, A. (2019). Skeleton based zero shot action recognition in joint pose-language semantic space. arXiv preprint arXiv:1911.11344.

  19. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, M. J. (2013). Towards understanding action recognition. In The IEEE International Conference on Computer Vision (ICCV).

  20. Kim, T. S., & Reiter, A. (2017). Interpretable 3d human action analysis with temporal convolutional networks. CoRR arXiv:1704.04516.

  21. Kipp, M. (2001). Anvil-a generic annotation tool for multimodal dialogue. In Seventh European conference on speech communication and technology.

  22. Kocabas, M., Athanasiou, N., & Black, M. J. (2020). Vibe: Video inference for human body pose and shape estimation. In The IEEE conference on computer vision and pattern recognition (CVPR).

  23. Kolotouros, N., Pavlakos, G., Black, M. J., & Daniilidis, K. (2019). Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV.

  24. Kundu, J. N., Gor, M., Uppala, P. K., & Babu, R. V. (2018). Unsupervised feature learning of human actions as trajectories in pose embedding manifold. CoRR arXiv:1812.02592.

  25. Li, B., Dai, Y., Cheng, X., Chen, H., Lin, Y., & He, M. (2017). Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 601–604. IEEE.

  26. Li, C., Zhong, Q., Xie, D., & Pu, S. (2017). Skeleton-based action recognition with convolutional neural networks. CoRR arXiv:1704.07595.

  27. Li, C., Zhong, Q., Xie, D., & Pu, S. (2018). Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In Proceedings of the twenty-seventh international joint conference on artificial intelligence, IJCAI-18, pp. 786–792. International joint conferences on artificial intelligence organization. https://doi.org/10.24963/ijcai.2018/109. https://doi.org/10.24963/ijcai.2018/109.

  28. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., & Tian, Q. (2019). Actional-structural graph convolutional networks for skeleton-based action recognition. In The IEEE conference on computer vision and pattern recognition (CVPR).

  29. Li, W., Zhang, Z., & Liu, Z. (2010). Action recognition based on a bag of 3d points. In 2010 IEEE Computer Society conference on computer vision and pattern recognition-workshops (pp. 9–14). IEEE.

  30. Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L. Y., & Kot, A. C. (2019). Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence,. https://doi.org/10.1109/TPAMI.2019.2916873.

    Article  Google Scholar 

  31. Liu, Z., Zhang, H., Chen, Z., Wang, Z., & Ouyang, W. (2020). Disentangling and unifying graph convolutions for skeleton-based action recognition. In The IEEE conference on computer vision and pattern recognition (CVPR).

  32. Ma, S., Sigal, L., & Sclaroff, S. (2016). Learning activity progression in lstms for activity detection and early detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1942–1950.

  33. Moon, G., Kwon, H., Lee, K. M., & Cho, M. (2020). Integral action: Pose-driven feature integration for robust human action recognition in videos.

  34. Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., & Weber, A. (2007). Documentation mocap database hdm05. Tech. Rep. CG-2007-2, Universität Bonn.

  35. Peng, W., Hong, X., Chen, H., Zhao, G. (2020). Learning graph convolutional network for skeleton-based human action recognition by neural searching. The thirty-fourth AAAI conference on artificial intelligence, AAAI.

  36. Presti, L. L., & La Cascia, M. (2016). 3d skeleton-based human action classification: A survey. Pattern Recognition, 53, 130–147.

    Article  Google Scholar 

  37. Rogez, G., Weinzaepfel, P., & Schmid, C. (2019). LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  38. Ryoo, M. (2011). Human activity prediction: Early recognition of ongoing activities from streaming videos. Proceedings of the IEEE international conference on computer vision, 1036–1043. https://doi.org/10.1109/ICCV.2011.6126349.

  39. Sadeghipour, A., & Morency, L. P. (2011). 3d iconic gesture dataset. https://doi.org/10.4119/UNIBI/2683224. https://pub.uni-bielefeld.de/record/2683224.

  40. Seidenari, L., Varano, V., Berretti, S., Bimbo, A., & Pala, P. (2013). Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 479–485.

  41. Shahroudy, A., Liu, J., Ng, T. T., Wang, G. (2016). Ntu rgb+d: A large scale dataset for 3d human activity analysis. In The IEEE conference on computer vision and pattern recognition (CVPR).

  42. Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019). Skeleton-based action recognition with directed graph neural networks. In The IEEE conference on computer vision and pattern recognition (CVPR).

  43. Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019). Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In CVPR.

  44. Si, C., Chen, W., Wang, W., Wang, L., & Tan, T. (2019). An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In The IEEE conference on computer vision and pattern recognition (CVPR).

  45. Sigurdsson, G. A., Russakovsky, O., & Gupta, A. (2017). What actions are needed for understanding human actions in videos? In ICCV.

  46. Song, S., Lan, C., Xing, J., Zeng, W., & Liu, J. (2016). An end-to-end spatio-temporal attention model for human action recognition from skeleton data. CoRR arXiv:1611.06067.

  47. Song, Y. F., Zhang, Z., & Wang, L. (2019). Richly activated graph convolutional network for action recognition with incomplete skeletons. In International conference on image processing (ICIP). IEEE.

  48. Tang, Y., Tian, Y., Lu, J., Li, P., & Zhou, J. (2018). Deep progressive reinforcement learning for skeleton-based action recognition. In The IEEE conference on computer vision and pattern recognition (CVPR).

  49. Vemulapalli, R., Arrate, F., & Chellappa, R. (2014). Human action recognition by representing 3d skeletons as points in a lie group. 2014 IEEE conference on computer vision and pattern recognition, pp. 588–595.

  50. Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In IEEE International conference on computer vision and pattern recognition (CVPR).

  51. Wang, P., Li, W., Ogunbona, P., Wan, J., & Escalera, S. (2018). Rgb-d-based human motion recognition with deep learning: A survey. Computer Vision and Image Understanding, 171, 118–139.

    Article  Google Scholar 

  52. Weinzaepfel, P., & Rogez, G. (2019). Mimetics: Towards understanding human actions out of context. arXiv preprint arXiv:1912.07249.

  53. Weinzaepfel, P., & Rogez, G. (2021). Mimetics: Towards understanding human actions out of context. International Journal of Computer Vision,. https://doi.org/10.1007/s11263-021-01446-y.

    MathSciNet  Article  Google Scholar 

  54. Wu, C., Wu, X.J., & Kittler, J. (2019). Spatial residual layer and dense connection block enhanced spatial temporal graph convolutional network for skeleton-based action recognition. In The IEEE international conference on computer vision (ICCV) workshops.

  55. Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.

  56. Zhang, J., Li, W., Wang, P., Ogunbona, P., Liu, S., & Tang, C. (2016). A large scale rgb-d dataset for action recognition. In International workshop on understanding human activities through 3D sensors (pp. 101–114). Springer.

  57. Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., & Zheng, N. (2017). View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE international conference on computer vision, pp. 2117–2126.

  58. Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., & Zheng, N. (2019). View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  59. Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., & Xie, X. (2016). Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. arXiv:1603.07772.

Download references

Acknowledgements

We wish to thank the anonymous reviewers for their detailed and constructive feedback. We also wish to thank Kalyan Adithya and Sai Shashank Kalakonda for their efforts in creating the project page. This work is partly supported by MeitY, Government of India.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ravi Kiran Sarvadevabhatla.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Manuel J. Marin-Jimenez.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gupta, P., Thatipelli, A., Aggarwal, A. et al. Quo Vadis, Skeleton Action Recognition?. Int J Comput Vis 129, 2097–2112 (2021). https://doi.org/10.1007/s11263-021-01470-y

Download citation

Keywords

  • Human action recognition
  • Human activity recognition
  • Skeleton
  • 3-D human pose
  • Deep learning