## Abstract

3D action recognition has attracted much attention in machine learning fields in recent years, and recurrent neural networks (RNNs) have been widely used for 3D action recognition due to their efficiency in processing sequential data. However, in order to achieve good performance, traditional RNN architectures are usually time-consuming for the training and inference process. To address the problem, a global context-aware attention spatio-temporal SRU (GCA-ST-SRU) method is proposed and applied for 3D action recognition in this paper, through extending the original simple recurrent unit (SRU) algorithm to joint spatio-temporal domain with an attention mechanism. First, deep neural networks were employed to learn the features of skeleton joints at each frame, and then these new high-level feature sequences were classified using the GCA-ST-SRU method which can learn the spatio-temporal dependence between different joints in the same frame and pay more attention to informative joints. Extensive experiments were conducted on the UT-Kinect and SBU-Kinect Interaction datasets to evaluate the effectiveness of the proposed method. Compared with several existing algorithms including SRU, long short-term memory (LSTM), spatio-temporal LSTM (ST-LSTM) and global context-aware attention LSTM (GCA-LSTM), our method has exhibited better performance in classification accuracy and computational efficiency. The experimental results demonstrate the effectiveness and practicability of our algorithm. Compared to the methods with similar performance, our algorithms can reduce training time and improve the inference speed, and thus it achieves a balance between speed and accuracy.

This is a preview of subscription content, log in to check access.

## Access options

### Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price **includes VAT** for USA

### Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 199

This is the **net price**. Taxes to be calculated in checkout.

## References

- 1.
Ahmed F, Paul P P (2016) Gavrilova M L. Joint-triplet motion image and local binary pattern for 3d action recognition using Kinect. In: Proceedings of the 29th International Conference on Computer Animation and Social Agents. ACM, pp: 111–119

- 2.
Baradel F, Wolf C, Mille J (2017) Human action recognition: pose-based attention draws focus to hands. Proceedings of the IEEE International Conference on Computer Vision, In, pp 604–613

- 3.
Chen C, Liu K, Kehtarnavaz N (2016) Real-time human action recognition based on depth motion maps. J Real-Time Image Proc 12(1):155–163

- 4.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078

- 5.
De Mulder W, Bethard S, Moens MF (2015) A survey on the application of recurrent neural networks to statistical language modeling. Comput Speech Lang 30(1):61–98

- 6.
Di Gangi M A, Federico M. (2018) Deep neural machine translation with weakly-recurrent unit. arXiv preprint arXiv:1805.04185

- 7.
Gal Y, Ghahramani Z (2016) A theoretically grounded application of dropout in recurrent neural networks. Advances in neural information processing systems, In, pp 1019–1027

- 8.
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

- 9.
Huang H, Wang H, Mak B (2019) Recurrent poisson process unit for speech recognition. Recurrent Poisson Process Unit for Speech Recognition, AAAI, In, pp 6538–6545

- 10.
Kingma D P, Ba J (2014) Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980

- 11.
Kong, Y., & Fu, Y. (2018). Human action recognition and prediction: a survey. arXiv preprint arXiv:1806.11230

- 12.
Koniusz P, Cherian A, Porikli F (2016) Tensor representations via kernel linearization for action recognition from 3d skeletons. In: European conference on computer vision. Springer, Cham, pp 37–53

- 13.
Lei, T., Zhang, Y., Wang, S. I., Dai, H., & Artzi, Y. (2018) Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp: 4470–4481

- 14.
Liu J, Shahroudy A, Xu D, Wang G (2016, October) Spatio-temporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision, Springer, Cham, pp 816–833

- 15.
Liu J, Wang G, Duan LY, Abdiyeva K, Kot AC (2017) Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans Image Process 27(4):1586–1599

- 16.
Mabrouk AB, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems: a review. Journal of Expert Systems with Applications 91(1):480–491

- 17.
Park, J., Boo, Y., Choi, I., Shin, S., & Sung, W. (2018) Fully neural network based speech recognition on mobile and embedded devices. In Advances in Neural Information Processing Systems, pp: 10620–10630

- 18.
Park, C., Lee, C., Hong, L., Hwang, Y., Yoo, T., Jang, J., ... & Kim, H. K. (2019) S2-Net: machine reading comprehension with SRU-based self-matching networks. ETRI J, 41(3):371–382

- 19.
Presti LL, La Cascia M (2016) 3D skeleton-based human action classification: a survey. Pattern Recogn 53(5):130–147

- 20.
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, In, pp 568–576

- 21.
Slama R, Wannous H, Daoudi M et al (2015) Accurate 3D action recognition using learning on the Grassmann manifold. Pattern Recogn 48(2):556–567

- 22.
Tamamori A, Hayashi T, Toda T, Takeda K (2018) Daily activity recognition based on recurrent neural network using multi-modal signals. APSIPA Transactions on Signal and Information Processing 7:E21. https://doi.org/10.1017/ATSIP.2018.25

- 23.
Tang Y, Tian Y, Lu J, Li P, Zhou J (2018) Deep progressive reinforcement learning for skeleton-based action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, In, pp 5323–5332

- 24.
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. Proceedings of the IEEE conference on computer vision and pattern recognition, In, pp 588–595

- 25.
Wang D, Nyberg E (2015) A long short-term memory model for answer sentence selection in question answering. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing 2(4): 707–712

- 26.
Xi R, Li M, Hou M, Fu M, Qu H, Liu D, Haruna CR (2018) Deep dilation on multimodality time series for human activity recognition. IEEE Access 6(1):53381–53396

- 27.
Xia, L., Chen, C. C., & Aggarwal, J. K. (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, pp 20-27

- 28.
Yang, Z., Bu, L., Wang, T., Ouyang, J., & Yuan, P. (2018, July) Fire Alarm for Video Surveillance Based on Convolutional Neural Network and SRU. In: 2018 5th International Conference on Information Science and Control Engineering (ICISCE), pp: 232–236

- 29.
Yun, K., Honorio, J., Chattopadhyay, D., Berg, T. L., & Samaras, D. (2012, June). Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp: 28–35

- 30.
Zhang S, Liu X, Xiao J (2017) On geometric features for skeleton-based action recognition using multilayer lstm networks. In: 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, pp: 148–157

- 31.
Zhang, P., Xue, J., Lan, C., Zeng, W., Gao, Z., & Zheng, N. (2018) Adding attentiveness to the neurons in recurrent neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp: 135–151

- 32.
Zheng Z, An G, Ruan Q (2017) Multi-level recurrent residual networks for action recognition. arXiv preprint arXiv:1711.08238.

- 33.
Zhu Y, Chen W, Guo G (2013) Fusing spatiotemporal features and joints for 3d action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, In, pp 486–491

- 34.
Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., & Xie, X. (2016, March) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In Thirtieth AAAI Conference on Artificial Intelligence

## Acknowledgments

This work is supported by National Natural Science Foundation of China under Grant No.61871427. The authors would like to acknowledge the UT-Kinect Dataset and SBU-Kinect Interaction Dataset which were used to test the algorithms proposed in this study.

## Author information

## Additional information

### Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendices

### Appendix 1 . Long short-term memory network (LSTM)

The LSTM network is a popular model for processing sequence data which has good performance on modeling long-term temporal dependencies. According to the length of the sequence data, the compute process of the LSTM network is divided into multiple steps, and the computation process on each step can be formulated as below:

where *M* is a model parameter matrix, *x*_{t} is the input at *t*^{th} step, *h*_{t − 1} is the output state at (*t* − 1)^{th} step, *σ* denotes the sigmoid activation function. The detailed process is given as follows: The input gate *i*_{t}, forget gate *f*_{t}, output gate *o*_{t} and input information *u*_{t} are firstly calculated according to Eq. (18), and the elements of these matrices range from 0 to 1. Then, the internal state *c*_{t} is updated according to Eq. (19), where ⊙ denotes the element-wise product, *i*_{t} determines the weight of *u*_{t}, and *f*_{t} determines the weight of the previous internal state *c*_{t − 1}. Finally, the output state *h*_{t} is obtained by Eq. (20).

### Appendix 2. Spatio-temporal long short-term memory network (ST-LSTM)

The ST-LSTM is a spatial extension of LSTM, which can capture the dependence of joints in temporal and spatial domains simultaneously. The ST-STLM transition equations are formulated as below:

where *x*_{j, t} is the input at the spatio-temporal step (*j*, *t*) corresponds to the *t*^{th} frame and *j*^{th}joint, Eq. (21) extends *f*_{t} in Eq. (18) to \( {f}_{j,t}^S \) and \( {f}_{j,t}^T \), which correspond to the forget gate in spatial and temporal domain, respectively. Similarly, In Eq. (22), the computation of internal state *c*_{j, t} depends on *c*_{j − 1, t} at previous temporal step and *c*_{j, t − 1} at previous spatial step.

### Appendix 3. simple recurrent unit (SRU)

The SRU is a recurrent network like LSTM and GRU, but the majority of computation for each step is independent of the recurrence and can be easily parallelized [13]. The transition equations are formulated as below:

where *x*_{t} denotes input at time *t*, \( {\tilde{x}}_t \) is a linear transformation of *x*_{t}, *f*_{t} is the forget state, *r*_{t} is the reset gate, *c*_{t} denotes internal state and *h*_{t} denotes output state. Different from LSTM, the computation of \( {\tilde{x}}_t \), *f*_{t} and *r*_{t} of SRU does not depend on the internal state *c*_{t − 1} at previous time step, and the inference speed is promoted by parallelizing these computations.

The difference between the computation process of LSTM and SRU is shown in Fig. 6. For LSTM, the symbols in each rectangle represent the variables that need to be computed in each time step *t* = 1, 2, ..., *n*. It shows that the computation of *h*_{t} depends on completing the previous time step, and the sequential dependencies limit the inference speed of the LSTM. For SRU, the computations of \( {\tilde{x}}_t \), *f*_{t} and *r*_{t} depend only on the input *x*_{t} at each time step, so they are independent of the recurrence as shown in the dotted-line box, the recurrent computations of *c*_{t} and *h*_{t} involves only element-wise product implementations that are relatively lightweight, and thus they are fast. Parallelizing the majority of computation for each step can significantly improve the inference speed of SRU.

## Rights and permissions

## About this article

### Cite this article

She, Q., Mu, G., Gan, H. *et al.* Spatio-temporal SRU with global context-aware attention for 3D human action recognition.
*Multimed Tools Appl* (2020) doi:10.1007/s11042-019-08587-w

Received:

Revised:

Accepted:

Published:

### Keywords

- 3D action recognition
- Recurrent neural networks
- Simple recurrent unit
- Spatio-temporal analysis
- Attention mechanism