Skip to main content
Log in

Viewpoint guided multi-stream neural network for skeleton action recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Skeleton-based human action recognition has attracted considerable attention and succeeded in computer vision. However, one of the main challenges for skeleton action recognition is the complex viewpoint variations. Moreover, existing methods may be prone to develop the complicated networks with large model size. To this end, in this paper, we introduce a novel viewpoint-guided feature by adaptively selecting the optimal observation point to deal with the viewpoint variation problem. Furthermore, we present a novel multi-stream neural network for skeleton action recognition, namely Viewpoint Guided Multi-stream Neural Network (VGMNet). In particular, by incorporating four streams from spatial and temporal information, the proposed VGMNet can effectively learn the discriminative features of the skeleton sequence.We validate our method on three famous datasets, i.e., SHREC, NTU RGB+D, and Florence 3D. On SHREC, our proposed method has achieved better performance in terms of accuracy and efficiency against the state-of-the-art approaches. Furthermore, the highest scores on Florence 3D and NTU RGB+D show that our method is suitable for real application scenario with edge computing, and compatible to the case of multi-person action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The datasets analysed during the current study are available in the following links.

\(\bullet \) SHREC:   http://www-rech.telecom-lille.fr/shrec2017-hand

\(\bullet \) NTU RGB+D:  https://rose1.ntu.edu.sg/dataset/actionRecognition

\(\bullet \) Florence 3D:  https://www.micc.unifi.it/resources/datasets/florence-3d-actions-dataset

References

  1. Aggarwal JK, Xia L (2014) Human activity recognition from 3d data: A review. Pattern Recogn Lett 48:70–80

    Article  Google Scholar 

  2. Ahad MAR, Ahmed M, Antar AD, Makihara Y, Yagi Y (2021) Action recognition using kinematics posture feature on 3d skeleton joint locations. Pattern Recogn Lett 145:216–224

    Article  Google Scholar 

  3. Anirudh R, Turaga P, Su J, Srivastava A (2015) Elastic functional coding of human actions: From vector-fields to latent variables. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3147–3155

  4. Caputo FM, Prebianca P, Carcangiu A, Spano LD, Giachetti A (2017) A 3 cent recognizer: Simple and effective retrieval and classification of mid-air gestures from single 3d traces. In: STAG, pp 9–15

  5. Chaudhry R, Ofli F, Kurillo G, Bajcsy R, Vidal R (2013) Bio-inspired dynamic 3d discriminative skeletal features for human action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 471–478

  6. Chen X, Wang G, Guo H, Zhang C, Wang H, Zhang L (2019) Mfa-net: Motion feature augmented network for dynamic hand gesture recognition from skeletal data. Sensors 19:239

    Article  Google Scholar 

  7. Cho S, Maqbool M, Liu F, Foroosh H (2020) Self-attention network for skeleton-based human action recognition. In: Proceedings of the IEEE/CVFWinter conference on applications of computer vision, pp 635–644

  8. De Smedt Q, Wannous H, Vandeborre JP, Guerry J, Le Saux B, Filliat D (2017) Shrec’17 track: 3d hand gesture recognition using a depth and skeletal dataset. In: 3DOR-10th Eurographics Workshop on 3D Object Retrieval, pp 1–6

  9. De Smedt Q, Wannous H, Vandeborre JP (2016) Skeleton-based dynamic hand gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 1–9

  10. Devanne M, Wannous H, Berretti S, Pala P, Daoudi M, Del Bimbo A (2015) 3-d human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE transactions on cybernetics 45(7):1340–1352

    Article  Google Scholar 

  11. Devineau G, Xi W, Moutarde F, Yang J (2018) Convolutional neural networks for multivariate time series classification using both inter-and intra-channel parallel convolutions. In: Reconnaissance des formes image apprentissage et perception (RFIAP–2018)

  12. Ding Y, Zhu Y, Wu Y, Jun F, Cheng Z (2019) Spatio-temporal attention lstm model for flood forecasting. 2019 International Conference on Internet of Things (IThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber. Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), IEEE, pp 458–465

    Google Scholar 

  13. Ellis C, Masood SZ, Tappen MF, LaViola JJ, Sukthankar R (2013) Exploring the trade-off between accuracy and observational latency in action recognition. Int J Comput Vis 101:420–436

    Article  Google Scholar 

  14. Goel N, Kaur S, Bala R (2021) Dual branch convolutional neural network for copy move forgery detection. IET Image Process 15:656–665

    Article  Google Scholar 

  15. Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR (2012) Improving neural networks by preventing co-adaptation of feature detectors.arXiv preprint arXiv:1207.0580

  16. Hou B, Miolane N, Khanal B, Lee MC, Alansary A, McDonagh S, Hajnal JV, Rueckert D, Glocker B, Kainz B (2018a) Computing cnn loss and gradients for pose estimation with riemannian geometry. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, pp 756–764

  17. Hou J, Wang G, Chen X, Xue JH, Zhu R, Yang H (2018) Spatialtemporal attention res-tcn for skeleton-based dynamic hand gesture recognition. In: Proceedings of the european conference on computer vision (ECCV) workshops, pp 0–0

  18. Huang G, Yan Q (2020) Optimizing features quality: a normalized covariance fusion framework for skeleton action recognition. IEEE Access 8:211869–211881

    Article  Google Scholar 

  19. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, PMLR, pp 448–456

  20. Jg Feng, Xiao J (2015) View-invariant human action recognition via robust locally adaptive multi-view learning. Front Inf Technol Electron Eng 16:917–929

    Article  Google Scholar 

  21. Ji X, Liu H (2009) Advances in view-invariant human motion analysis: A review. IEEE Trans Syst Man Cybern Part C Appl Rev 40:13–24

    Google Scholar 

  22. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  23. Li C, Hou Y, Wang P, Li W (2017) Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Proc Lett 24:624–628

    Article  Google Scholar 

  24. Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q (2019) Actionalstructural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3595–3603

  25. Li S, Li W, Cook C, Zhu C, Gao Y (2018b) Independently recurrent neural network (indrnn): Building a longer and deeper rnn. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5457–5466

  26. Liu J, Wang G, Duan LY, Abdiyeva K, Kot AC (2018) Skeleton-based human action recognition with global context-aware attention lstm networks. IEEE Trans Image Process 27:1586–1599

    Article  MathSciNet  Google Scholar 

  27. Li L, Zheng W, Zhang Z, Huang Y,Wang L (2018) Skeleton-based relational modeling for action recognition. 1:3. arXiv preprint arXiv:1805.02556

  28. Li C, Zhong Q, Xie D, Pu S (2018a) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055

  29. Ma B, Su Y, Jurie F (2014) Covariance descriptor based on bio-inspired features for person re-identification and face verification. Image Vis Comput 32:379–390

    Article  Google Scholar 

  30. Nunez JC, Cabido R, Pantrigo JJ, Montemayor AS, Velez JF (2018) Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recog 76:80–94

    Article  Google Scholar 

  31. Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R (2014) Sequence of the most informative joints (smij): A new representation for human skeletal action recognition. J Vis Commun Image Represent 25:24–38

    Article  Google Scholar 

  32. Pandey P, Gupta R, Goel N (2021) A fast and effective vision enhancement method for single foggy image. Eng Sci Technol Int J 24:1478–1489

    Google Scholar 

  33. Paoletti G, Cavazza J, Beyan C, Del Bue A (2021) Subspace clustering for action recognition with covariance representations and temporal pruning. In: 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, pp 6035–6042

  34. Rao C, Shah M (2001) View-invariance in action recognition. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001

  35. Seidenari L, Varano V, Berretti S, Bimbo A, Pala P (2013) Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 479–485

  36. Shahroudy A, Liu J, Ng TT, Wang G (2016) Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019

  37. Shao Z, Li Y (2013) A new descriptor for multiple 3d motion trajectories recognition. In: 2013 IEEE international conference on robotics and automation, IEEE, pp 4749–4754

  38. Shi L, Zhang Y, Cheng J, Lu H (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans Image Process 29:9532–9545

    Article  Google Scholar 

  39. Singh I, Zhu X, Greenspan M (2020) Multi-modal fusion with observation points for skeleton action recognition. In: 2020 IEEE International Conference on Image Processing (ICIP), IEEE, pp 1781–1785

  40. Sun N, Leng L, Liu J, Han G (2021) Multi-stream slowfast graph convolutional networks for skeleton-based action recognition. Image Vis Comput 109:104141

    Article  Google Scholar 

  41. Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595

  42. Wang P, Li W, Ogunbona P, Gao Z, Zhang H (2014) Mining mid-level features for action recognition based on effective skeleton representation. In: 2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA), IEEE, pp 1–8

  43. Wang H, Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 499–508

  44. Wang C, Wang Y, Yuille AL (2013) An approach to pose-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 915–922

  45. Xia L, Chen CC, Aggarwal JK (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, IEEE, pp 20–27

  46. Xu B, Wang N, Chen T, Li M (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853

  47. Yang X, Tian YL (2012) Eigenjoints-based action recognition using naivebayes-nearest-neighbor. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, IEEE, pp 14–19

  48. Yang F, Wu Y, Sakti S, Nakamura S (2019) Make skeleton-based action recognition model smaller, faster and better. In: Proceedings of the ACM multimedia asia, pp 1–6

  49. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence

  50. Zabrovskiy A, Agrawal P, Mathá R, Timmerer C, Prodan R (2020) Complexcttp: Complexity class based transcoding time prediction for video sequences using artificial neural network. In: 2020 IEEE Sixth international conference on multimedia big data (BigMM), pp 316–325

  51. Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. In: Proceedings of the IEEE international conference on computer vision, pp 2752–2759

  52. Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE international conference on computer vision, pp 2117–2126

  53. Zhang P, Lan C, Zeng W, Xing J, Xue J, Zheng N (2020) Semanticsguided neural networks for efficient skeleton-based human action recognition. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1112–1121

  54. Zhu G, Zhang L, Li H, Shen P, Shah SAA, Bennamoun M (2020) Topology-learnable graph convolution for skeleton-based action recognition. Pattern Recogn Lett 135:286–292

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China Grant 61876042, and the Guangdong Basic and Applied Basic Research Foundation (No. 2020A1515011493).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ming Yin.

Ethics declarations

Conflicts of interests

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, Y., Liang, Z., He, S. et al. Viewpoint guided multi-stream neural network for skeleton action recognition. Multimed Tools Appl 83, 6783–6802 (2024). https://doi.org/10.1007/s11042-023-15676-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15676-4

Keywords

Navigation