Abstract
3D action recognition has attracted much attention in machine learning fields in recent years, and recurrent neural networks (RNNs) have been widely used for 3D action recognition due to their efficiency in processing sequential data. However, in order to achieve good performance, traditional RNN architectures are usually time-consuming for the training and inference process. To address the problem, a global context-aware attention spatio-temporal SRU (GCA-ST-SRU) method is proposed and applied for 3D action recognition in this paper, through extending the original simple recurrent unit (SRU) algorithm to joint spatio-temporal domain with an attention mechanism. First, deep neural networks were employed to learn the features of skeleton joints at each frame, and then these new high-level feature sequences were classified using the GCA-ST-SRU method which can learn the spatio-temporal dependence between different joints in the same frame and pay more attention to informative joints. Extensive experiments were conducted on the UT-Kinect and SBU-Kinect Interaction datasets to evaluate the effectiveness of the proposed method. Compared with several existing algorithms including SRU, long short-term memory (LSTM), spatio-temporal LSTM (ST-LSTM) and global context-aware attention LSTM (GCA-LSTM), our method has exhibited better performance in classification accuracy and computational efficiency. The experimental results demonstrate the effectiveness and practicability of our algorithm. Compared to the methods with similar performance, our algorithms can reduce training time and improve the inference speed, and thus it achieves a balance between speed and accuracy.
Similar content being viewed by others
References
Ahmed F, Paul P P (2016) Gavrilova M L. Joint-triplet motion image and local binary pattern for 3d action recognition using Kinect. In: Proceedings of the 29th International Conference on Computer Animation and Social Agents. ACM, pp: 111–119
Baradel F, Wolf C, Mille J (2017) Human action recognition: pose-based attention draws focus to hands. Proceedings of the IEEE International Conference on Computer Vision, In, pp 604–613
Chen C, Liu K, Kehtarnavaz N (2016) Real-time human action recognition based on depth motion maps. J Real-Time Image Proc 12(1):155–163
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
De Mulder W, Bethard S, Moens MF (2015) A survey on the application of recurrent neural networks to statistical language modeling. Comput Speech Lang 30(1):61–98
Di Gangi M A, Federico M. (2018) Deep neural machine translation with weakly-recurrent unit. arXiv preprint arXiv:1805.04185
Gal Y, Ghahramani Z (2016) A theoretically grounded application of dropout in recurrent neural networks. Advances in neural information processing systems, In, pp 1019–1027
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Huang H, Wang H, Mak B (2019) Recurrent poisson process unit for speech recognition. Recurrent Poisson Process Unit for Speech Recognition, AAAI, In, pp 6538–6545
Kingma D P, Ba J (2014) Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980
Kong, Y., & Fu, Y. (2018). Human action recognition and prediction: a survey. arXiv preprint arXiv:1806.11230
Koniusz P, Cherian A, Porikli F (2016) Tensor representations via kernel linearization for action recognition from 3d skeletons. In: European conference on computer vision. Springer, Cham, pp 37–53
Lei, T., Zhang, Y., Wang, S. I., Dai, H., & Artzi, Y. (2018) Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp: 4470–4481
Liu J, Shahroudy A, Xu D, Wang G (2016, October) Spatio-temporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision, Springer, Cham, pp 816–833
Liu J, Wang G, Duan LY, Abdiyeva K, Kot AC (2017) Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans Image Process 27(4):1586–1599
Mabrouk AB, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems: a review. Journal of Expert Systems with Applications 91(1):480–491
Park, J., Boo, Y., Choi, I., Shin, S., & Sung, W. (2018) Fully neural network based speech recognition on mobile and embedded devices. In Advances in Neural Information Processing Systems, pp: 10620–10630
Park, C., Lee, C., Hong, L., Hwang, Y., Yoo, T., Jang, J., ... & Kim, H. K. (2019) S2-Net: machine reading comprehension with SRU-based self-matching networks. ETRI J, 41(3):371–382
Presti LL, La Cascia M (2016) 3D skeleton-based human action classification: a survey. Pattern Recogn 53(5):130–147
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, In, pp 568–576
Slama R, Wannous H, Daoudi M et al (2015) Accurate 3D action recognition using learning on the Grassmann manifold. Pattern Recogn 48(2):556–567
Tamamori A, Hayashi T, Toda T, Takeda K (2018) Daily activity recognition based on recurrent neural network using multi-modal signals. APSIPA Transactions on Signal and Information Processing 7:E21. https://doi.org/10.1017/ATSIP.2018.25
Tang Y, Tian Y, Lu J, Li P, Zhou J (2018) Deep progressive reinforcement learning for skeleton-based action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, In, pp 5323–5332
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. Proceedings of the IEEE conference on computer vision and pattern recognition, In, pp 588–595
Wang D, Nyberg E (2015) A long short-term memory model for answer sentence selection in question answering. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing 2(4): 707–712
Xi R, Li M, Hou M, Fu M, Qu H, Liu D, Haruna CR (2018) Deep dilation on multimodality time series for human activity recognition. IEEE Access 6(1):53381–53396
Xia, L., Chen, C. C., & Aggarwal, J. K. (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, pp 20-27
Yang, Z., Bu, L., Wang, T., Ouyang, J., & Yuan, P. (2018, July) Fire Alarm for Video Surveillance Based on Convolutional Neural Network and SRU. In: 2018 5th International Conference on Information Science and Control Engineering (ICISCE), pp: 232–236
Yun, K., Honorio, J., Chattopadhyay, D., Berg, T. L., & Samaras, D. (2012, June). Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp: 28–35
Zhang S, Liu X, Xiao J (2017) On geometric features for skeleton-based action recognition using multilayer lstm networks. In: 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, pp: 148–157
Zhang, P., Xue, J., Lan, C., Zeng, W., Gao, Z., & Zheng, N. (2018) Adding attentiveness to the neurons in recurrent neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp: 135–151
Zheng Z, An G, Ruan Q (2017) Multi-level recurrent residual networks for action recognition. arXiv preprint arXiv:1711.08238.
Zhu Y, Chen W, Guo G (2013) Fusing spatiotemporal features and joints for 3d action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, In, pp 486–491
Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., & Xie, X. (2016, March) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In Thirtieth AAAI Conference on Artificial Intelligence
Acknowledgments
This work is supported by National Natural Science Foundation of China under Grant No.61871427. The authors would like to acknowledge the UT-Kinect Dataset and SBU-Kinect Interaction Dataset which were used to test the algorithms proposed in this study.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1 . Long short-term memory network (LSTM)
The LSTM network is a popular model for processing sequence data which has good performance on modeling long-term temporal dependencies. According to the length of the sequence data, the compute process of the LSTM network is divided into multiple steps, and the computation process on each step can be formulated as below:
where M is a model parameter matrix, xt is the input at tth step, ht − 1 is the output state at (t − 1)th step, σ denotes the sigmoid activation function. The detailed process is given as follows: The input gate it, forget gate ft, output gate ot and input information ut are firstly calculated according to Eq. (18), and the elements of these matrices range from 0 to 1. Then, the internal state ct is updated according to Eq. (19), where ⊙ denotes the element-wise product, it determines the weight of ut, and ft determines the weight of the previous internal state ct − 1. Finally, the output state ht is obtained by Eq. (20).
Appendix 2. Spatio-temporal long short-term memory network (ST-LSTM)
The ST-LSTM is a spatial extension of LSTM, which can capture the dependence of joints in temporal and spatial domains simultaneously. The ST-STLM transition equations are formulated as below:
where xj, t is the input at the spatio-temporal step (j, t) corresponds to the tth frame and jthjoint, Eq. (21) extends ft in Eq. (18) to \( {f}_{j,t}^S \) and \( {f}_{j,t}^T \), which correspond to the forget gate in spatial and temporal domain, respectively. Similarly, In Eq. (22), the computation of internal state cj, t depends on cj − 1, t at previous temporal step and cj, t − 1 at previous spatial step.
Appendix 3. simple recurrent unit (SRU)
The SRU is a recurrent network like LSTM and GRU, but the majority of computation for each step is independent of the recurrence and can be easily parallelized [13]. The transition equations are formulated as below:
where xt denotes input at time t, \( {\tilde{x}}_t \) is a linear transformation of xt, ft is the forget state, rt is the reset gate, ct denotes internal state and ht denotes output state. Different from LSTM, the computation of \( {\tilde{x}}_t \), ft and rt of SRU does not depend on the internal state ct − 1 at previous time step, and the inference speed is promoted by parallelizing these computations.
The difference between the computation process of LSTM and SRU is shown in Fig. 6. For LSTM, the symbols in each rectangle represent the variables that need to be computed in each time step t = 1, 2, ..., n. It shows that the computation of ht depends on completing the previous time step, and the sequential dependencies limit the inference speed of the LSTM. For SRU, the computations of \( {\tilde{x}}_t \), ft and rt depend only on the input xt at each time step, so they are independent of the recurrence as shown in the dotted-line box, the recurrent computations of ct and ht involves only element-wise product implementations that are relatively lightweight, and thus they are fast. Parallelizing the majority of computation for each step can significantly improve the inference speed of SRU.
Rights and permissions
About this article
Cite this article
She, Q., Mu, G., Gan, H. et al. Spatio-temporal SRU with global context-aware attention for 3D human action recognition. Multimed Tools Appl 79, 12349–12371 (2020). https://doi.org/10.1007/s11042-019-08587-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08587-w