Skip to main content

Understanding the limits of 2D skeletons for action recognition

Abstract

With the development of motion capture technologies, 3D action recognition has become a popular task that finds great applicability in many areas, such as augmented reality, human–computer interaction, sports, or healthcare. On the other hand, the acquisition of 3D human skeleton data is an expensive and time-consuming process, mainly due to the high costs of capturing technologies and the absence of suitable actors. We overcome these issues by focusing on the 2D skeleton modality that can be easily extracted from ordinary videos. The objective of this work is to demonstrate a high descriptive power of such a 2D skeleton modality by achieving accuracy on the task of daily action recognition competitive to 3D skeleton data. More importantly, we thoroughly analyze the factors that significantly influence the 2D recognition accuracy, such as the sensitivity towards data normalization, scaling, quantization, and 3D-to-2D distortions in skeleton orientations and sizes, which are caused by the loss of depth dimension and fixed-angle camera view. We also provide valuable insights on how to mitigate these problems to increase recognition accuracy significantly. The experimental evaluation is conducted on three datasets different in nature. The ability to learn different types of actions better using either 2D or 3D skeletons is also reported. Throughout experiments, a generic light-weight LSTM network is used, whose architecture can be easily tuned to achieve the desired trade-off between its accuracy and efficiency. We show that the proposed approach achieves not only the state-of-the-art results in 2D skeleton action recognition but is also highly competitive to the best-performing methods classifying 3D skeleton sequences or the visual content extracted from ordinary videos.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Notes

  1. 1.

    Same trends and experimental outcomes were observed also with the uni-directional architecture, which can preferably be used in online scenarios, where the future evolution of skeleton poses is not known for the backward pass.

References

  1. 1.

    Ameur, S., Khalifa, A.B., Bouhlel, M.S.: A novel hybrid bidirectional unidirectional LSTM network for dynamic hand gesture recognition with leap motion. Entertain. Comput. 35, 100373 (2020)

    Article  Google Scholar 

  2. 2.

    Aubry, S., Laraba, S., Tilmanne, J., Dutoit, T.: Action recognition based on 2d skeletons extracted from rgb videos. MATEC Web Conf. 277, 02034 (2019)

    Article  Google Scholar 

  3. 3.

    Cao, C., Zhang, Y., Zhang, C., Lu, H.: Body joint guided 3-D deep convolutional descriptors for action recognition. IEEE Trans. Cybernet. 48(3), 1095–1108 (2018)

    Article  Google Scholar 

  4. 4.

    Cao, Z., Simon, T., Wei, S., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1302–1310 (2017)

  5. 5.

    Carrara, F., Elias, P., Sedmidubsky, J., Zezula, P.: Lstm-based real-time action detection and prediction in human motion streams. Multimedia Tools Appl. 78(19), 27309–27331 (2019)

    Article  Google Scholar 

  6. 6.

    Chen, C., Ramanan, D.: 3d human pose estimation = 2d pose estimation + matching. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 5759–5767 (2017)

  7. 7.

    Das, S., Koperski, M., Bremond, F., Francesca, G.: Action recognition based on a mixture of RGB and depth based skeleton. In: Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6 (2017)

  8. 8.

    Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1110–1118. IEEE Computer Society (2015)

  9. 9.

    Elhayek, A., de Aguiar, E., Jain, A., Tompson, J., Pishchulin, L., Andriluka, M., Bregler, C., Schiele, B., Theobalt, C.: Marconi—convnet-based marker-less motion capture in outdoor and indoor scenes. IEEE Trans. Pattern Anal. Mach. Intell. 39(3), 501–514 (2017)

    Article  Google Scholar 

  10. 10.

    Elias, P., Sedmidubský, J., Zezula, P.: Understanding the gap between 2d and 3d skeleton-based action recognition. In: 21st IEEE International Symposium on Multimedia, ISM 2019, San Diego, USA, December 9–11, 2019, pp. 192–195 (2019)

  11. 11.

    Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fründ, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., Memisevic, R.: The ”something something” video database for learning and evaluating visual common sense. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, pp. 5843–5851. IEEE Computer Society (2017)

  12. 12.

    Güder, M., Cicekli, N.K.: Multi-modal video event recognition based on association rules and decision fusion. Multimedia Syst. 24(1), 55–72 (2018)

    Article  Google Scholar 

  13. 13.

    Huang, L., Huang, Y., Ouyang, W., Wang, L.: Hierarchical graph convolutional network for skeleton-based action recognition. In: Image and Graphics, pp. 93–102. Springer International Publishing, Cham (2019)

  14. 14.

    Iqbal, U., Doering, A., Yasin, H., Krüger, B., Weber, A., Gall, J.: A dual-source approach for 3d human pose estimation from single images. Comput. Vis. Image Underst. 172, 37–49 (2018)

    Article  Google Scholar 

  15. 15.

    Iqbal, U., Garbade, M., Gall, J.: Pose for action–action for pose. In: 12th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2017, Washington, DC, USA, May 30–June 3, 2017, pp. 438–445 (2017)

  16. 16.

    Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pp. 7122–7131 (2018)

  17. 17.

    Kreiss, S., Bertoni, L., Alahi, A.: Pifpaf: Composite fields for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 11977–11986 (2019)

  18. 18.

    Laraba, S., Brahimi, M., Tilmanne, J., Dutoit, T.: 3d skeleton-based action recognition by representing motion capture sequences as 2D-RGB images. Comput. Anim. Virt. Worlds 28(3–4), e1782 (2017)

    Article  Google Scholar 

  19. 19.

    Laurent, C., Pereyra, G., Brakel, P., Zhang, Y., Bengio, Y.: Batch normalized recurrent neural networks. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20–25, 2016, pp. 2657–2661 (2016)

  20. 20.

    Liu, A., Xu, N., Nie, W., Su, Y., Zhang, Y.: Multi-domain and multi-task learning for human action recognition. IEEE Trans. Image Process. 28(2), 853–867 (2019)

    MathSciNet  Article  Google Scholar 

  21. 21.

    Liu, B., Cai, H., Ju, Z., Liu, H.: RGB-D sensing based human action and interaction analysis: A survey. Pattern Recogn. 94, 1–12 (2019)

    Article  Google Scholar 

  22. 22.

    Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: a large scale benchmark for skeleton-based human action understanding. In: Proceedings of the Workshop on Visual Analysis in Smart and Connected Communities (VSCC@MM), pp. 1–8 (2017)

  23. 23.

    Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3d human action recognition. In: Proceedings of the 14th European Conference on Computer Vision (ECCV), pp. 816–833 (2016)

  24. 24.

    Liu, J., Wang, G., Duan, L.Y., Abdiyeva, K., Kot, A.: Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Process. (TIP) 27(4), 1586–1599 (2018)

    MathSciNet  Article  Google Scholar 

  25. 25.

    Liu, K., Gao, L., Khan, N.M., Qi, L., Guan, L.: Graph convolutional networks-hidden conditional random field model for skeleton-based action recognition. In: 21st International Symposium on Multimedia (ISM), pp. 25–31. IEEE (2019)

  26. 26.

    Liu, M., Yuan, J.: Recognizing human actions as the evolution of pose estimation maps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1159–1168 (2018)

  27. 27.

    Liu, R., Xu, C., Zhang, T., Zhao, W., Cui, Z., Yang, J.: SI-GCN: structure-induced graph convolution network for skeleton-based action recognition. In: International Joint Conference on Neural Networks, IJCNN 2019 Budapest, Hungary, July 14–19, 2019, pp. 1–8 (2019)

  28. 28.

    Liu, R., Xu, C., Zhang, T., Zhao, W., Cui, Z., Yang, J.: SI-GCN: structure-induced graph convolution network for skeleton-based action recognition. In: International Joint Conference on Neural Networks, IJCNN 2019 Budapest, Hungary, July 14–19, 2019, pp. 1–8. IEEE (2019)

  29. 29.

    Luvizon, D.C., Picard, D., Tabia, H.: 2D/3D pose estimation and action recognition using multitask deep learning. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 5137–5146 (2018)

  30. 30.

    Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, pp. 2659–2668 (2017)

  31. 31.

    Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H., Xu, W., Casas, D., Theobalt, C.: Vnect: real-time 3d human pose estimation with a single RGB camera. ACM Trans. Graph. 36(4), 44:1–44:14 (2017)

    Article  Google Scholar 

  32. 32.

    Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: Documentation Mocap database HDM05. Tech. Rep. CG-2007-2, Universität Bonn (2007)

  33. 33.

    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Proceedings of the 14th European Conference on Computer Vision (ECCV), pp. 483–499 (2016)

  34. 34.

    Papadakis, A., Mathe, E., Vernikos, I., Maniatis, A., Spyrou, E., Mylonas, P.: Recognizing human actions using 3d skeletal information and cnns. In: Proceedings of the 20th Intl. Conference on Engineering Applications of Neural Networks (EANN), pp. 511–521 (2019)

  35. 35.

    Poppe, R., Van Der Zee, S., Heylen, D.K.J., Taylor, P.J.: Amab: Automated measurement and analysis of body motion. Behav. Res. Methods (BRM) 46(3), 625–633 (2014)

    Google Scholar 

  36. 36.

    Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525 (2017)

  37. 37.

    Rhodin, H., Salzmann, M., Fua, P.: Unsupervised geometry-aware representation for 3D human pose estimation. In: Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part X, pp. 765–782 (2018)

  38. 38.

    Rhodin, H., Spörri, J., Katircioglu, I., Constantin, V., Meyer, F., Müller, E., Salzmann, M., Fua, P.: Learning monocular 3D human pose estimation from multi-view images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8437–8446 (2018)

  39. 39.

    Rodríguez-Moreno, I., Martínez-Otzeta, J.M., Sierra, B., Rodriguez, I.R., Jauregi, E.: Video activity recognition: state-of-the-art. Sensors 19(14), 3160 (2019)

    Article  Google Scholar 

  40. 40.

    Sanesi, G., Bagdanov, A.D., Bertini, M., Bimbo, A.D.: Deepphysio: Monitored physiotherapeutic exercise in the comfort of your own home. In: Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, October 21–25, 2019, pp. 2219–2220. ACM (2019)

  41. 41.

    Sarafianos, N., Boteanu, B., Ionescu, B., Kakadiaris, I.A.: 3D human pose estimation: a review of the literature and analysis of covariates. Comput. Vis. Image Underst. 152, 1–20 (2016)

    Article  Google Scholar 

  42. 42.

    Sedmidubsky, J., Elias, P., Zezula, P.: Effective and efficient similarity searching in motion capture data. Multimedia Tools Appl. (MTAP) 77(10), 12073–12094 (2018)

    Article  Google Scholar 

  43. 43.

    Sedmidubsky, J., Elias, P., Zezula, P.: Searching for variable-speed motions in long sequences of motion capture data. Inf. Syst. 80, 148–158 (2019)

    Article  Google Scholar 

  44. 44.

    Sedmidubsky, J., Zezula, P.: Probabilistic classification of skeleton sequences. In: Database and Expert Systems Applications - 29th International Conference, DEXA 2018, Regensburg, Germany, September 3–6, 2018, Proceedings, Part II, Lecture Notes in Computer Science, vol. 11030, pp. 50–65. Springer (2018)

  45. 45.

    Sedmidubsky, J., Zezula, P.: Augmenting spatio-temporal human motion data for effective 3D action recognition. In: 21st IEEE International Symposium on Multimedia (ISM), pp. 204–207. IEEE Computer Society (2019)

  46. 46.

    Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  47. 47.

    Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: Spatio-temporal attention-based LSTM networks for 3D action recognition and detection. IEEE Trans. Image Process. 27(7), 3459–3471 (2018)

    MathSciNet  Article  Google Scholar 

  48. 48.

    Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5693–5703 (2019)

  49. 49.

    Thakkar, K.C., Narayanan, P.J.: Part-based graph convolutional network for action recognition. In: British Machine Vision Conference (BMVC), pp. 1–13. BMVA Press (2018)

  50. 50.

    Tran, K.N., Gala, A., Kakadiaris, I.A., Shah, S.K.: Activity analysis in crowded environments using social cues for group discovery and human interaction modeling. Pattern Recognit. Lett. 44, 49–57 (2014)

    Article  Google Scholar 

  51. 51.

    Tsunoda, T., Komori, Y., Matsugu, M., Harada, T.: Football action recognition using hierarchical LSTM. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21–26, 2017, pp. 155–163. IEEE Computer Society (2017)

  52. 52.

    Wang, L., Huynh, D.Q., Koniusz, P.: A comparative review of recent kinect-based action recognition algorithms. IEEE Trans. Image Process. 29, 15–28 (2020)

    MathSciNet  Article  Google Scholar 

  53. 53.

    Wu, H., Shao, J., Xu, X., Ji, Y., Shen, F., Shen, H.T.: Recognition and detection of two-person interactive actions using automatically selected skeleton features. IEEE Trans. Hum. Mach. Syst. 48(3), 304–310 (2018)

    Article  Google Scholar 

  54. 54.

    Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the 21nd AAAI Conference on Artificial Intelligence, pp. 7444–7452 (2018)

  55. 55.

    Yang, H., Gu, Y., Zhu, J., Hu, K., Zhang, X.: PGCN-TCA: pseudo graph convolutional network with temporal and channel-wise attention for skeleton-based action recognition. IEEE Access 8, 10040–10047 (2020)

    Article  Google Scholar 

  56. 56.

    Yin, J., Han, J., Wang, C., Zhang, B., Zeng, X.: A skeleton-based action recognition system for medical condition detection. In: 2019 IEEE Biomedical Circuits and Systems Conference, BioCAS 2019, Nara, Japan, October 17–19, 2019, pp. 1–4. IEEE (2019)

  57. 57.

    Zhang, T., Zheng, W., Cui, Z., Zong, Y., Li, C., Zhou, X., Yang, J.: Deep manifold-to-manifold transforming network for skeleton-based action recognition. IEEE Trans. Multim. 22(11), 2926–2937 (2020)

    Article  Google Scholar 

  58. 58.

    Zhang, W., Zhu, M., Derpanis, K.G.: From actemes to action: a strongly-supervised representation for detailed action understanding. In: IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1–8, 2013, pp. 2248–2255 (2013)

  59. 59.

    Zhou, X., Sun, X., Zhang, W., Liang, S., Wei, Y.: Deep kinematic pose regression. In: Computer Vision - ECCV 2016 Workshops - Amsterdam, The Netherlands, October 8–10 and 15–16, 2016, Proceedings, Part III, pp. 186–201 (2016)

  60. 60.

    Zhu, J., Zou, W., Zhu, Z., Xu, L., Huang, G.: Action machine: toward person-centric action recognition in videos. IEEE Signal Process. Lett. 26(11), 1633–1637 (2019)

    Article  Google Scholar 

  61. 61.

    Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., Xie, X.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence, pp. 3697–3703 (2016)

Download references

Acknowledgements

This research is supported by the Czech Science Foundation project No. GA19-02033S.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Petr Elias.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by F. Wu.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Elias, P., Sedmidubsky, J. & Zezula, P. Understanding the limits of 2D skeletons for action recognition. Multimedia Systems 27, 547–561 (2021). https://doi.org/10.1007/s00530-021-00754-0

Download citation