Skip to main content
Log in

Predictively encoded graph convolutional network for noise-robust skeleton-based action recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In skeleton-based action recognition, graph convolutional networks (GCNs), which model human body skeletons using graphical components such as nodes and connections, have recently achieved remarkable performance. While the current state-of-the-art methods for skeleton-based action recognition usually assume that completely observed skeletons will be provided, it is problematic to realize this assumption in real-world scenarios since the captured skeletons may be incomplete or noisy. In this work, we propose a skeleton-based action recognition method that is robust to noise interference for the given skeleton features. The key insight of our approach is to train a model by maximizing the mutual information between normal and noisy skeletons using predictive coding in the latent space. We conducted comprehensive skeleton-based action recognition experiments with defective skeletons using the NTU-RGB+D and Kinetics-Skeleton datasets. The experimental results demonstrate that when the skeleton samples are noisy, our approach achieves outstanding performances compared with the existing state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Atal BS, Schroeder MR (1970) Adaptive predictive coding of speech signals. Bell Syst Technic J 49(8):1973–1986

    Article  Google Scholar 

  2. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Machine Intell 39(12):2481–2495

    Article  Google Scholar 

  3. Bengio Y, Senécal JS (2008) Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Trans Neural Netw 19(4):713–722

    Article  Google Scholar 

  4. Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291–7299

  5. Cao Z, Hidalgo G, Simon T, Wei SE, Sheikh Y (2018) Openpose: Realtime multi-person 2d pose estimation using part affinity fields. arXiv:181208008

  6. Cao Z, Hidalgo Martinez G, Simon T, Wei S, Sheikh YA (2019) Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence

  7. Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:14123555

  8. Ding Z, Wang P, Ogunbona PO, Li W (2017) Investigation of different skeleton features for cnn-based 3d action recognition. In: 2017 IEEE International conference on multimedia & expo workshops (ICMEW). IEEE, pp 617–622

  9. Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118

  10. Elias P (1955) Predictive coding–i. IRE Trans Inform Theory 1 (1):16–24. https://doi.org/10.1109/TIT.1955.1055126

    Article  Google Scholar 

  11. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  12. Fernando B, Gavves E, Oramas JM, Ghodrati A, Tuytelaars T (2015) Modeling video evolution for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5378–5387

  13. Gao T, Packer B, Koller D (2011) A segmentation-aware object detection model with occlusion handling. In: CVPR 2011. IEEE, pp 1361–1368

  14. Gao Z, Guo L, Guan W, Liu AA, Ren T, Chen S (2020a) A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-r2. IEEE Trans Image Process 30:767–782

    Article  Google Scholar 

  15. Gao Z, Guo L, Ren T, Liu AA, Cheng ZY, Chen S (2020b) Pairwise two-stream convnets for cross-domain action recognition with small data. IEEE Transactions on Neural Networks and Learning Systems

  16. Girdhar R, Ramanan D, Gupta A, Sivic J, Russell B (2017) Actionvlad: Learning spatio-temporal aggregation for action classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 971–980

  17. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440– 1448

  18. Gutmann M, Hyvärinen A (2010) Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 297–304

  19. Hussain M, Chen D, Cheng A, Wei H, Stanley D (2013) Change detection from remotely sensed images: From pixel-based to object-based approaches. In: ISPRS Journal of photogrammetry and remote sensing, vol 80, pp 91–106

  20. Jozefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y (2016) Exploring the limits of language modeling. arXiv:160202410

  21. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, et al. (2017) The kinetics human action video dataset. arXiv:170506950

  22. Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297

  23. Kim TS, Reiter A (2017) Interpretable 3d human action analysis with temporal convolutional networks. In: 2017 IEEE Conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 1623–1631

  24. Li B, Dai Y, Cheng X, Chen H, Lin Y, He M (2017) Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: 2017 IEEE International conference on multimedia & expo workshops (ICMEW). IEEE, pp 601–604

  25. Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3595–3603

  26. Li YM, Gao Z, Tao YB, Wang LL, Xue YB (2020) 3d object retrieval based on non-local graph neural networks. Multimed Tools Appl 79(45):34011–34027

    Article  Google Scholar 

  27. Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision. Springer, pp 816–833

  28. Liu J, Wang G, Duan LY, Abdiyeva K, Kot AC (2017a) Skeleton-based human action recognition with global context-aware attention lstm networks. IEEE Trans Image Process 27(4):1586–1599

    Article  MathSciNet  Google Scholar 

  29. Liu J, Rahmani H, Akhtar N, Mian A (2019) Learning human pose models from synthesized data for robust rgb-d action recognition. Int J Comput Vis 127(10):1545–1564

    Article  Google Scholar 

  30. Liu M, Liu H, Chen C (2017b) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn 68:346–362

    Article  Google Scholar 

  31. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:13013781

  32. Mnih A, Teh YW (2012) A fast and simple algorithm for training neural probabilistic language models. arXiv:12066426

  33. Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv:180703748

  34. Peng W, Hong X, Chen H, Zhao G (2019) Learning graph convolutional network for skeleton-based human action recognition by neural searching. arXiv:191104131

  35. Qian R, Meng T, Gong B, Yang MH, Wang H, Belongie S, Cui Y (2020) Spatiotemporal contrastive video representation learning. arXiv:200803800

  36. Shahroudy A, Liu J, Ng TT, Wang G (2016) Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019

  37. Shi L, Zhang Y, Cheng J, Lu H (2019a) Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7912–7921

  38. Shi L, Zhang Y, Cheng J, LU H (2019b) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. arXiv:191206971

  39. Shi L, Zhang Y, Cheng J, Lu H (2019c) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12026–12035

  40. Si C, Jing Y, Wang W, Wang L, Tan T (2018) Skeleton-based action recognition with spatial reasoning and temporal stack learning. In: Proceedings of the european conference on computer vision (ECCV), pp 103–118

  41. Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1227– 1236

  42. Song S, Lan C, Xing J, Zeng W, Liu J (2017) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Thirty-first AAAI conference on artificial intelligence

  43. Song YF, Zhang Z, Wang L (2019) Richly activated graph convolutional network for action recognition with incomplete skeletons. In: 2019 IEEE international conference on image processing (ICIP). IEEE, pp 1–5

  44. Song YF, Zhang Z, Shan C, Wang L (2020) Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In: Proceedings of the 28th ACM international conference on multimedia (ACMMM), association for computing machinery, New York, NY, USA, pp 1625–1633. https://doi.org/10.1145/3394171.3413802

  45. Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  46. Tang Y, Tian Y, Lu J, Li P, Zhou J (2018) Deep progressive reinforcement learning for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5323–5332

  47. Thakkar K, Narayanan P (2018) Part-based graph convolutional network for action recognition. arXiv:180904983

  48. Tran C, Trivedi MM (2011) 3-d posture and gesture recognition for interactivity in smart spaces. IEEE Trans Indust Inform 8(1):178–187

    Article  Google Scholar 

  49. Vondrick C, Pirsiavash H, Torralba A (2016) Anticipating visual representations from unlabeled video. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 98– 106

  50. Wang C, Xu D, Zhu Y, Martín-Martín R, Lu C, Fei-Fei L, Savarese S (2019a) Densefusion: 6d object pose estimation by iterative dense fusion. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3343–3352

  51. Wang L, Koniusz P, Huynh DQ (2019b) Hallucinating idt descriptors and i3d optical flow features for action recognition with cnns. In: Proceedings of the IEEE international conference on computer vision, pp 8698–8708

  52. Wang X, Han TX, Yan S (2009) An hog-lbp human detector with partial occlusion handling. In: 2009 IEEE 12th international conference on computer vision. IEEE, pp 32–39

  53. Wang X, Gao L, Wang P, Sun X, Liu X (2017a) Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimed 20(3):634–644

    Article  Google Scholar 

  54. Wang X, Shrivastava A, Gupta A (2017b) A-fast-rcnn: Hard positive generation via adversary for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2606–2615

  55. Xia L, Chen CC, Aggarwal JK (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE, pp 20–27

  56. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI conference on artificial intelligence

  57. Yu J, Yow KC, Jeon M (2018) Joint representation learning of appearance and motion for abnormal event detection. Mach Vis Appl 29(7):1157–1170

    Article  Google Scholar 

  58. Yu J, Park S, Lee S, Jeon M (2019) Driver drowsiness detection using condition-adaptive representation learning framework. IEEE Trans Intell Transp Syst 20 (11):4206–4218. https://doi.org/10.1109/TITS.2018.2883823

    Article  Google Scholar 

  59. Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE international conference on computer vision, pp 2117– 2126

  60. Zhang Z (2012) Microsoft kinect sensor and its effect. IEEE Multimed 19(2):4–10

    Article  Google Scholar 

Download references

Acknowledgments

This work was partly supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2014-0-00077, Development of global multitarget tracking and event prediction techniques based on real-time large-scale video analysis) and the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT). (No. 2019R1A2C208748911).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Moongu Jeon.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yoon, Y., Yu, J. & Jeon, M. Predictively encoded graph convolutional network for noise-robust skeleton-based action recognition. Appl Intell 52, 2317–2331 (2022). https://doi.org/10.1007/s10489-021-02487-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02487-z

Keywords

Navigation