Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition

  • Jun Liu
  • Amir Shahroudy
  • Dong Xu
  • Gang WangEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9907)


3D action recognition – analysis of human actions based on 3D skeleton data – becomes popular recently due to its succinctness, robustness, and view-invariant representation. Recent attempts on this problem suggested to develop RNN-based learning methods to model the contextual dependency in the temporal domain. In this paper, we extend this idea to spatio-temporal domains to analyze the hidden sources of action-related information within the input data over both domains concurrently. Inspired by the graphical structure of the human skeleton, we further propose a more powerful tree-structure based traversal method. To handle the noise and occlusion in 3D skeleton data, we introduce new gating mechanism within LSTM to learn the reliability of the sequential input data and accordingly adjust its effect on updating the long-term context information stored in the memory cell. Our method achieves state-of-the-art performance on 4 challenging benchmark datasets for 3D human action analysis.


3D action recognition Recurrent neural networks Long short-term memory Trust gate Spatio-temporal analysis 



The research is supported by Singapore Ministry of Education (MOE) Tier 2 ARC28/14, and Singapore A*STAR Science and Engineering Research Council PSF1321202099. This research was carried out at the Rapid-Rich Object Search (ROSE) Lab at Nanyang Technological University. The ROSE Lab is supported by the National Research Foundation, Singapore, under its Interactive Digital Media (IDM) Strategic Research Programme. We also would like to thank NVIDIA for the GPU donation.


  1. 1.
    Presti, L.L., La Cascia, M.: 3d skeleton-based human action classification: a survey. PR 53, 130–147 (2016)Google Scholar
  2. 2.
    Han, F., Reily, B., Hoff, W., Zhang, H.: Space-time representation of people based on 3d skeletal data: a review. arXiv (2016)Google Scholar
  3. 3.
    Zhu, F., Shao, L., Xie, J., Fang, Y.: From handcrafted to learned representations for human action recognition: a survey. IVC (2016, in press)Google Scholar
  4. 4.
    Yang, X., Tian, Y.: Effective 3d action recognition using eigenjoints. JVCIR 25, 2–11 (2014)Google Scholar
  5. 5.
    Xia, L., Chen, C., Aggarwal, J.: View invariant human action recognition using histograms of 3d joints. In: CVPRW (2012)Google Scholar
  6. 6.
    Evangelidis, G., Singh, G., Horaud, R.: Skeletal quads: Human action recognition using joint quadruples. In: ICPR (2014)Google Scholar
  7. 7.
    Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3d skeletons as points in a lie group. In: CVPR (2014)Google Scholar
  8. 8.
    Luo, J., Wang, W., Qi, H.: Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: ICCV (2013)Google Scholar
  9. 9.
    Ohn-Bar, E., Trivedi, M.: Joint angles similarities and hog\(^2\) for action recognition. In: CVPRW (2013)Google Scholar
  10. 10.
    Mikolov, T., Kombrink, S., Burget, L., Černockỳ, J.H., Khudanpur, S.: Extensions of recurrent neural network language model. In: ICASSP (2011)Google Scholar
  11. 11.
    Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: INTERSPEECH (2012)Google Scholar
  12. 12.
    Mesnil, G., He, X., Deng, L., Bengio, Y.: Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In: INTERSPEECH (2013)Google Scholar
  13. 13.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)Google Scholar
  14. 14.
    Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)Google Scholar
  15. 15.
    Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR (2015)Google Scholar
  16. 16.
    Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMS. In: ICML (2015)Google Scholar
  17. 17.
    Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: CVPR (2016)Google Scholar
  18. 18.
    Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: CVPR (2016)Google Scholar
  19. 19.
    Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: Human trajectory prediction in crowded spaces. In: CVPR (2016)Google Scholar
  20. 20.
    Deng, Z., Vahdat, A., Hu, H., Mori, G.: Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In: CVPR (2016)Google Scholar
  21. 21.
    Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. In: CVPR (2016)Google Scholar
  22. 22.
    Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in LSTMS for activity detection and early detection. In: CVPR (2016)Google Scholar
  23. 23.
    Ni, B., Yang, X., Gao, S.: Progressively parsing interactional objects for fine grained action detection. In: CVPR (2016)Google Scholar
  24. 24.
    Li, Y., Lan, C., Xing, J., Zeng, W., Yuan, C., Liu, J.: Online human action detection using joint classification-regression recurrent neural networks. arXiv (2016)Google Scholar
  25. 25.
    Varior, R.R., Shuai, B., Lu, J., Xu, D., Wang, G.: A siamese long short-term memory architecture for human re-identification. In: ECCV (2016)Google Scholar
  26. 26.
    Varior, R.R., Haloi, M., Wang, G.: Gated siamese convolutional neural network architecture for human re-identification. In: ECCV (2016)Google Scholar
  27. 27.
    Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  28. 28.
    Li, Q., Qiu, Z., Yao, T., Mei, T., Rui, Y., Luo, J.: Action recognition by learning deep multi-granular spatio-temporal video representation. In: ICMR (2016)Google Scholar
  29. 29.
    Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: ACM MM (2015)Google Scholar
  30. 30.
    Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR (2015)Google Scholar
  31. 31.
    Veeriah, V., Zhuang, N., Qi, G.J.: Differential recurrent neural networks for action recognition. In: ICCV (2015)Google Scholar
  32. 32.
    Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: A large scale dataset for 3d human activity analysis. In: CVPR (2016)Google Scholar
  33. 33.
    Wang, J., Liu, Z., Wu, Y., Yuan, J.: Learning actionlet ensemble for 3d human action recognition. In: TPAMI (2014)Google Scholar
  34. 34.
    Meng, M., Drira, H., Daoudi, M., Boonaert, J.: Human-object interaction recognition by learning the distances between the object and the skeleton joints. In: FG (2015)Google Scholar
  35. 35.
    Shahroudy, A., Ng, T.T., Yang, Q., Wang, G.: Multimodal multipart learning for action recognition in depth videos. In: TPAMI (2016)Google Scholar
  36. 36.
    Wang, J., Wu, Y.: Learning maximum margin temporal warping for action recognition. In: ICCV (2013)Google Scholar
  37. 37.
    Rahmani, H., Mahmood, A., Huynh, D.Q., Mian, A.: Real time action recognition using histograms of depth gradients and random decision forests. In: WACV (2014)Google Scholar
  38. 38.
    Shahroudy, A., Wang, G., Ng, T.T.: Multi-modal feature fusion for action recognition in RGB-D sequences. In: ISCCSP (2014)Google Scholar
  39. 39.
    Wang, C., Wang, Y., Yuille, A.L.: Mining 3d key-pose-motifs for action recognition. In: CVPR (2016)Google Scholar
  40. 40.
    Rahmani, H., Mian, A.: Learning a non-linear knowledge transfer model for cross-view action recognition. In: CVPR (2015)Google Scholar
  41. 41.
    Lillo, I., Carlos Niebles, J., Soto, A.: A hierarchical pose-based approach to complex action understanding using dictionaries of actionlets and motion poselets. In: CVPR (2016)Google Scholar
  42. 42.
    Hu, J.F., Zheng, W.S., Ma, L., Wang, G., Lai, J.: Real-time RGB-D activity prediction by soft regression. In: ECCV (2016)Google Scholar
  43. 43.
    Chen, C., Jafari, R., Kehtarnavaz, N.: Fusion of depth, skeleton, and inertial data for human action recognition. In: ICASSP (2016)Google Scholar
  44. 44.
    Rahmani, H., Mian, A.: 3d action recognition from novel viewpoints. In: CVPR (2016)Google Scholar
  45. 45.
    Liu, Z., Zhang, C., Tian, Y.: 3d-based deep convolutional neural network for action recognition with depth sequences. IVC (2016, in press)Google Scholar
  46. 46.
    Cai, X., Zhou, W., Wu, L., Luo, J., Li, H.: Effective active skeleton representation for low latency human action recognition. TMM 18, 141–154 (2016)Google Scholar
  47. 47.
    Al Alwani, A.S., Chahir, Y.: Spatiotemporal representation of 3d skeleton joints-based action recognition using modified spherical harmonics. PR Lett. (2016, in press)Google Scholar
  48. 48.
    Tao, L., Vidal, R.: Moving poselets: A discriminative and interpretable skeletal motion representation for action recognition. In: ICCVW (2015)Google Scholar
  49. 49.
    Shahroudy, A., Ng, T.T., Gong, Y., Wang, G.: Deep multimodal feature analysis for action recognition in RGB+D videos. arXiv (2016)Google Scholar
  50. 50.
    Du, Y., Fu, Y., Wang, L.: Representation learning of temporal dynamics for skeleton-based action recognition. TIP 25, 3010–3022 (2016)MathSciNetGoogle Scholar
  51. 51.
    Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., Xie, X.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI (2016)Google Scholar
  52. 52.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15, 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  53. 53.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)CrossRefGoogle Scholar
  54. 54.
    Graves, A.: Supervised sequence labelling. In: Graves, A. (ed.) Supervised Sequence Labelling with Recurrent Neural Networks. SCI, vol. 385, pp. 5–13. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  55. 55.
    Zou, B., Chen, S., Shi, C., Providence, U.M.: Automatic reconstruction of 3d human motion pose from uncalibrated monocular video sequences based on markerless human motion tracking. PR 42, 1559–1571 (2009)zbMATHGoogle Scholar
  56. 56.
    Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: CVPR (2011)Google Scholar
  57. 57.
    Graves, A., Mohamed, A.r., Hinton, G.: Speech recognition with deep recurrent neural networks. In: ICASSP (2013)Google Scholar
  58. 58.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014)Google Scholar
  59. 59.
    Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: CVPRW (2012)Google Scholar
  60. 60.
    Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In: WACV (2013)Google Scholar
  61. 61.
    Hu, J.F., Zheng, W.S., Lai, J., Zhang, J.: Jointly learning heterogeneous features for RGB-D activity recognition. In: CVPR (2015)Google Scholar
  62. 62.
    Lin, L., Wang, K., Zuo, W., Wang, M., Luo, J., Zhang, L.: A deep structured model with radius-margin bound for 3d human activity recognition. IJCV 118, 256–273 (2015)MathSciNetCrossRefGoogle Scholar
  63. 63.
    Zhu, Y., Chen, W., Guo, G.: Fusing spatiotemporal features and joints for 3d action recognition. In: CVPRW (2013)Google Scholar
  64. 64.
    Ji, Y., Ye, G., Cheng, H.: Interactive body part contrast mining for human interaction recognition. In: ICMEW (2014)Google Scholar
  65. 65.
    Li, W., Wen, L., Choo Chuah, M., Lyu, S.: Category-blind human action recognition: a practical recognition system. In: ICCV (2015)Google Scholar
  66. 66.
    Slama, R., Wannous, H., Daoudi, M., Srivastava, A.: Accurate 3d action recognition using learning on the grassmann manifold. PR 48, 556–567 (2015)Google Scholar
  67. 67.
    Devanne, M., Wannous, H., Berretti, S., Pala, P., Daoudi, M., Del Bimbo, A.: 3-d human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE Trans. Cybern. 45, 1340–1352 (2015)CrossRefGoogle Scholar
  68. 68.
    Anirudh, R., Turaga, P., Su, J., Srivastava, A.: Elastic functional coding of human actions: from vector-fields to latent variables. In: CVPR (2015)Google Scholar
  69. 69.
    Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3d points. In: CVPRW (2010)Google Scholar
  70. 70.
    Vantigodi, S., Babu, R.V.: Real-time human action recognition from motion capture data. In: NCVPRIPG (2013)Google Scholar
  71. 71.
    Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Sequence of the most informative joints (SMIJ): a new representation for human skeletal action recognition. JVCIR 25, 24–38 (2014)Google Scholar
  72. 72.
    Vantigodi, S., Radhakrishnan, V.B.: Action recognition from motion capture data using meta-cognitive RBF network classifier. In: ISSNIP (2014)Google Scholar
  73. 73.
    Kapsouras, I., Nikolaidis, N.: Action recognition on motion capture data using a dynemes and forward differences representation. JVCIR 25, 1432–1445 (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.School of Electrical and Electronic EngineeringNanyang Technological UniversitySingaporeSingapore
  2. 2.School of Electrical and Information EngineeringUniversity of SydneySydneyAustralia

Personalised recommendations