Spatio-temporal SRU with global context-aware attention for 3D human action recognition

  • 5 Accesses


3D action recognition has attracted much attention in machine learning fields in recent years, and recurrent neural networks (RNNs) have been widely used for 3D action recognition due to their efficiency in processing sequential data. However, in order to achieve good performance, traditional RNN architectures are usually time-consuming for the training and inference process. To address the problem, a global context-aware attention spatio-temporal SRU (GCA-ST-SRU) method is proposed and applied for 3D action recognition in this paper, through extending the original simple recurrent unit (SRU) algorithm to joint spatio-temporal domain with an attention mechanism. First, deep neural networks were employed to learn the features of skeleton joints at each frame, and then these new high-level feature sequences were classified using the GCA-ST-SRU method which can learn the spatio-temporal dependence between different joints in the same frame and pay more attention to informative joints. Extensive experiments were conducted on the UT-Kinect and SBU-Kinect Interaction datasets to evaluate the effectiveness of the proposed method. Compared with several existing algorithms including SRU, long short-term memory (LSTM), spatio-temporal LSTM (ST-LSTM) and global context-aware attention LSTM (GCA-LSTM), our method has exhibited better performance in classification accuracy and computational efficiency. The experimental results demonstrate the effectiveness and practicability of our algorithm. Compared to the methods with similar performance, our algorithms can reduce training time and improve the inference speed, and thus it achieves a balance between speed and accuracy.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 199

This is the net price. Taxes to be calculated in checkout.

Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. 1.

    Ahmed F, Paul P P (2016) Gavrilova M L. Joint-triplet motion image and local binary pattern for 3d action recognition using Kinect. In: Proceedings of the 29th International Conference on Computer Animation and Social Agents. ACM, pp: 111–119

  2. 2.

    Baradel F, Wolf C, Mille J (2017) Human action recognition: pose-based attention draws focus to hands. Proceedings of the IEEE International Conference on Computer Vision, In, pp 604–613

  3. 3.

    Chen C, Liu K, Kehtarnavaz N (2016) Real-time human action recognition based on depth motion maps. J Real-Time Image Proc 12(1):155–163

  4. 4.

    Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078

  5. 5.

    De Mulder W, Bethard S, Moens MF (2015) A survey on the application of recurrent neural networks to statistical language modeling. Comput Speech Lang 30(1):61–98

  6. 6.

    Di Gangi M A, Federico M. (2018) Deep neural machine translation with weakly-recurrent unit. arXiv preprint arXiv:1805.04185

  7. 7.

    Gal Y, Ghahramani Z (2016) A theoretically grounded application of dropout in recurrent neural networks. Advances in neural information processing systems, In, pp 1019–1027

  8. 8.

    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

  9. 9.

    Huang H, Wang H, Mak B (2019) Recurrent poisson process unit for speech recognition. Recurrent Poisson Process Unit for Speech Recognition, AAAI, In, pp 6538–6545

  10. 10.

    Kingma D P, Ba J (2014) Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980

  11. 11.

    Kong, Y., & Fu, Y. (2018). Human action recognition and prediction: a survey. arXiv preprint arXiv:1806.11230

  12. 12.

    Koniusz P, Cherian A, Porikli F (2016) Tensor representations via kernel linearization for action recognition from 3d skeletons. In: European conference on computer vision. Springer, Cham, pp 37–53

  13. 13.

    Lei, T., Zhang, Y., Wang, S. I., Dai, H., & Artzi, Y. (2018) Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp: 4470–4481

  14. 14.

    Liu J, Shahroudy A, Xu D, Wang G (2016, October) Spatio-temporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision, Springer, Cham, pp 816–833

  15. 15.

    Liu J, Wang G, Duan LY, Abdiyeva K, Kot AC (2017) Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans Image Process 27(4):1586–1599

  16. 16.

    Mabrouk AB, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems: a review. Journal of Expert Systems with Applications 91(1):480–491

  17. 17.

    Park, J., Boo, Y., Choi, I., Shin, S., & Sung, W. (2018) Fully neural network based speech recognition on mobile and embedded devices. In Advances in Neural Information Processing Systems, pp: 10620–10630

  18. 18.

    Park, C., Lee, C., Hong, L., Hwang, Y., Yoo, T., Jang, J., ... & Kim, H. K. (2019) S2-Net: machine reading comprehension with SRU-based self-matching networks. ETRI J, 41(3):371–382

  19. 19.

    Presti LL, La Cascia M (2016) 3D skeleton-based human action classification: a survey. Pattern Recogn 53(5):130–147

  20. 20.

    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, In, pp 568–576

  21. 21.

    Slama R, Wannous H, Daoudi M et al (2015) Accurate 3D action recognition using learning on the Grassmann manifold. Pattern Recogn 48(2):556–567

  22. 22.

    Tamamori A, Hayashi T, Toda T, Takeda K (2018) Daily activity recognition based on recurrent neural network using multi-modal signals. APSIPA Transactions on Signal and Information Processing 7:E21.

  23. 23.

    Tang Y, Tian Y, Lu J, Li P, Zhou J (2018) Deep progressive reinforcement learning for skeleton-based action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, In, pp 5323–5332

  24. 24.

    Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. Proceedings of the IEEE conference on computer vision and pattern recognition, In, pp 588–595

  25. 25.

    Wang D, Nyberg E (2015) A long short-term memory model for answer sentence selection in question answering. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing 2(4): 707–712

  26. 26.

    Xi R, Li M, Hou M, Fu M, Qu H, Liu D, Haruna CR (2018) Deep dilation on multimodality time series for human activity recognition. IEEE Access 6(1):53381–53396

  27. 27.

    Xia, L., Chen, C. C., & Aggarwal, J. K. (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, pp 20-27

  28. 28.

    Yang, Z., Bu, L., Wang, T., Ouyang, J., & Yuan, P. (2018, July) Fire Alarm for Video Surveillance Based on Convolutional Neural Network and SRU. In: 2018 5th International Conference on Information Science and Control Engineering (ICISCE), pp: 232–236

  29. 29.

    Yun, K., Honorio, J., Chattopadhyay, D., Berg, T. L., & Samaras, D. (2012, June). Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp: 28–35

  30. 30.

    Zhang S, Liu X, Xiao J (2017) On geometric features for skeleton-based action recognition using multilayer lstm networks. In: 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, pp: 148–157

  31. 31.

    Zhang, P., Xue, J., Lan, C., Zeng, W., Gao, Z., & Zheng, N. (2018) Adding attentiveness to the neurons in recurrent neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp: 135–151

  32. 32.

    Zheng Z, An G, Ruan Q (2017) Multi-level recurrent residual networks for action recognition. arXiv preprint arXiv:1711.08238.

  33. 33.

    Zhu Y, Chen W, Guo G (2013) Fusing spatiotemporal features and joints for 3d action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, In, pp 486–491

  34. 34.

    Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., & Xie, X. (2016, March) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In Thirtieth AAAI Conference on Artificial Intelligence

Download references


This work is supported by National Natural Science Foundation of China under Grant No.61871427. The authors would like to acknowledge the UT-Kinect Dataset and SBU-Kinect Interaction Dataset which were used to test the algorithms proposed in this study.

Author information

Correspondence to Qingshan She.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Appendix 1 . Long short-term memory network (LSTM)

The LSTM network is a popular model for processing sequence data which has good performance on modeling long-term temporal dependencies. According to the length of the sequence data, the compute process of the LSTM network is divided into multiple steps, and the computation process on each step can be formulated as below:

$$ \left(\begin{array}{c}{i}_t\\ {}{f}_t\\ {}{o}_t\\ {}{u}_t\end{array}\right)=\left(\begin{array}{c}\sigma \\ {}\sigma \\ {}\sigma \\ {}\tanh \end{array}\right)\left(M\left(\begin{array}{c}{x}_t\\ {}{h}_{t-1}\end{array}\right)\right), $$
$$ {c}_t={i}_t\odot {u}_t+{f}_t\odot {c}_{t-1}, $$
$$ {h}_t={o}_t\odot \tanh \left({c}_t\right), $$

where M is a model parameter matrix, xt is the input at tth step, ht − 1 is the output state at (t − 1)th step, σ denotes the sigmoid activation function. The detailed process is given as follows: The input gate it, forget gate ft, output gate ot and input information ut are firstly calculated according to Eq. (18), and the elements of these matrices range from 0 to 1. Then, the internal state ct is updated according to Eq. (19), where ⊙ denotes the element-wise product, it determines the weight of ut, and ft determines the weight of the previous internal state ct − 1. Finally, the output state ht is obtained by Eq. (20).

Appendix 2. Spatio-temporal long short-term memory network (ST-LSTM)

The ST-LSTM is a spatial extension of LSTM, which can capture the dependence of joints in temporal and spatial domains simultaneously. The ST-STLM transition equations are formulated as below:

$$ \left(\begin{array}{c}{i}_{j,t}\\ {}{f}_{j,t}^S\\ {}{f}_{j,t}^T\\ {}{o}_{j,t}\\ {}{u}_{j,t}\end{array}\right)=\left(\begin{array}{c}\sigma \\ {}\sigma \\ {}\sigma \\ {}\sigma \\ {}\tanh \end{array}\right)\left(M\left(\begin{array}{c}{x}_{j,t}\\ {}{h}_{j-1,t}\\ {}{h}_{j,t-1}\end{array}\right)\right), $$
$$ {c}_{j,t}={i}_{j,t}\odot {u}_{j,t}+{f}_{j,t}^S\odot {c}_{j-1,t}+{f}_{j,t}^T\odot {c}_{j,t-1}, $$
$$ {h}_{j,t}={o}_{j,t}\odot \tanh \left({c}_{j,t}\right), $$

where xj, t is the input at the spatio-temporal step (j, t) corresponds to the tth frame and jthjoint, Eq. (21) extends ft in Eq. (18) to \( {f}_{j,t}^S \) and \( {f}_{j,t}^T \), which correspond to the forget gate in spatial and temporal domain, respectively. Similarly, In Eq. (22), the computation of internal state cj, t depends on cj − 1, t at previous temporal step and cj, t − 1 at previous spatial step.

Appendix 3. simple recurrent unit (SRU)

The SRU is a recurrent network like LSTM and GRU, but the majority of computation for each step is independent of the recurrence and can be easily parallelized [13]. The transition equations are formulated as below:

$$ {\tilde{x}}_t=W{x}_t, $$
$$ {f}_t=\mathrm{sigmoid}\left({W}_f{x}_t+{b}_f\right), $$
$$ {r}_t=\mathrm{sigmoid}\left({W}_r{x}_t+{b}_r\right), $$
$$ {c}_t={f}_t\odot {c}_{t-1}+\left(1-{f}_t\right)\odot {\tilde{x}}_t, $$
$$ {h}_t={r}_t\odot \tanh \left({c}_t\right)+\left(1-{r}_t\right)\odot {x}_t, $$

where xt denotes input at time t, \( {\tilde{x}}_t \) is a linear transformation of xt, ft is the forget state, rt is the reset gate, ct denotes internal state and ht denotes output state. Different from LSTM, the computation of \( {\tilde{x}}_t \), ft and rt of SRU does not depend on the internal state ct − 1 at previous time step, and the inference speed is promoted by parallelizing these computations.

The difference between the computation process of LSTM and SRU is shown in Fig. 6. For LSTM, the symbols in each rectangle represent the variables that need to be computed in each time step t = 1, 2, ..., n. It shows that the computation of ht depends on completing the previous time step, and the sequential dependencies limit the inference speed of the LSTM. For SRU, the computations of \( {\tilde{x}}_t \), ft and rt depend only on the input xt at each time step, so they are independent of the recurrence as shown in the dotted-line box, the recurrent computations of ct and ht involves only element-wise product implementations that are relatively lightweight, and thus they are fast. Parallelizing the majority of computation for each step can significantly improve the inference speed of SRU.

Fig. 6

Illustration of the compute process of LSTM and SRU. a Recurrent computation of LSTM, b Recurrent computation of SRU

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

She, Q., Mu, G., Gan, H. et al. Spatio-temporal SRU with global context-aware attention for 3D human action recognition. Multimed Tools Appl (2020) doi:10.1007/s11042-019-08587-w

Download citation


  • 3D action recognition
  • Recurrent neural networks
  • Simple recurrent unit
  • Spatio-temporal analysis
  • Attention mechanism