A Deep Learning Approach for Real-Time 3D Human Action Recognition from Skeletal Data

  • Huy Hieu PhamEmail author
  • Houssam Salmane
  • Louahdi Khoudour
  • Alain Crouzil
  • Pablo Zegers
  • Sergio A. Velastin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11662)


We present a new deep learning approach for real-time 3D human action recognition from skeletal data and apply it to develop a vision-based intelligent surveillance system. Given a skeleton sequence, we propose to encode skeleton poses and their motions into a single RGB image. An Adaptive Histogram Equalization (AHE) algorithm is then applied on the color images to enhance their local patterns and generate more discriminative features. For learning and classification tasks, we design Deep Neural Networks based on the Densely Connected Convolutional Architecture (DenseNet) to extract features from enhanced-color images and classify them into classes. Experimental results on two challenging datasets show that the proposed method reaches state-of-the-art accuracy, whilst requiring low computational time for training and inference. This paper also introduces CEMEST, a new RGB-D dataset depicting passenger behaviors in public transport. It consists of 203 untrimmed real-world surveillance videos of realistic normal and anomalous events. We achieve promising results on real conditions of this dataset with the support of data augmentation and transfer learning techniques. This enables the construction of real-world applications based on deep learning for enhancing monitoring and security in public transport.


Action recognition Skeletal data Enhanced-SPMF DenseNet 



This research was supported by the Cerema, France. Sergio A. Velastin is grateful for funding from the Universidad Carlos III de Madrid, the EU’s 7th Framework Programme for Research, Technological Development and demonstration (grant 600371), Ministerio de Economia, Industria y Competitividad (COFUND2013-51509), Ministerio de Educación, cultura y Deporte (CEI-15-17) and Banco Santander.

Supplementary material

481340_1_En_2_MOESM1_ESM.pdf (2.2 mb)
Supplementary material 1 (pdf 2293 KB)


  1. 1.
    Bilen, H., Fernando, B., Gavves, E., Vedaldi, A.: Action recognition with dynamic image networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(12), 2799–2813 (2018)CrossRefGoogle Scholar
  2. 2.
    Chen, C., Liu, K., Kehtarnavaz, N.: Real-time human action recognition based on depth motion maps. Journal of Real-Time Image Processing 12(1), 155–163 (2016)CrossRefGoogle Scholar
  3. 3.
    Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: Potion: Pose motion representation for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7024–7033 (2018)Google Scholar
  4. 4.
    Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by Exponential Linear Units (ELUs). arXiv preprint arXiv:1511.07289 (2015)
  5. 5.
    Ding, Z., Wang, P., Ogunbona, P.O., Li, W.: Investigation of different skeleton features for cnn-based 3d action recognition. In: 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). pp. 617–622. IEEE (2017)Google Scholar
  6. 6.
    Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: IEEE CVPR. pp. 1110–1118 (2015)Google Scholar
  7. 7.
    Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: International Conference on Artificial Intelligence and Statistics (AISTATS). pp. 315–323 (2011)Google Scholar
  8. 8.
    Han, L., Wu, X., Liang, W., Hou, G., Jia, Y.: Discriminative human action recognition in the learned hierarchical manifold space. Image and Vision Computing 28(5), 836–849 (2010)CrossRefGoogle Scholar
  9. 9.
    He, K., Sun, J.: Convolutional neural networks at constrained time cost. In: IEEE CVPR. pp. 5353–5360 (2015)Google Scholar
  10. 10.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: IEEE ICCV. pp. 1026–1034 (2015)Google Scholar
  11. 11.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE CVPR. pp. 770–778 (2016)Google Scholar
  12. 12.
    Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Computation 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  13. 13.
    Hu, J., Zheng, W.S., Lai, J.H., Jianguo, Z.: Jointly learning heterogeneous features for RGB-D activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 2186–2200 (2015)CrossRefGoogle Scholar
  14. 14.
    Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. In: IEEE CVPR. p. 3 (2017)Google Scholar
  15. 15.
    Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML. pp. 448–456 (2015)Google Scholar
  16. 16.
    Johansson, G.: Visual perception of biological motion and a model for its analysis. Perception & Psychophysics 14(2), 201–211 (1973)CrossRefGoogle Scholar
  17. 17.
    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  18. 18.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature 521(7553), 436 (2015)Google Scholar
  19. 19.
    Lee, I., Kim, D., Kang, S., Lee, S.: Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1012–1020 (2017)Google Scholar
  20. 20.
    Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: IEEE CVPR. pp. 9–14 (2010)Google Scholar
  21. 21.
    Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: ECCV. pp. 816–833 (2016)CrossRefGoogle Scholar
  22. 22.
    Liu, J., Wang, G., Duan, L.Y., Abdiyeva, K., Kot, A.C.: Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Transactions on Image Processing 27(4), 1586–1599 (2018)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Liu, J., Wang, G., Hu, P., Duan, L.Y., Kot, A.C.: Global context-aware attention LSTM networks for 3D action recognition. In: IEEE CVPR. pp. 3671–3680 (2017)Google Scholar
  24. 24.
    Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition 68, 346–362 (2017)CrossRefGoogle Scholar
  25. 25.
    Luo, J., Wang, W., Qi, H.: Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: IEEE ICCV. pp. 1809–1816 (2013)Google Scholar
  26. 26.
    Lv, F., Nevatia, R.: Recognition and segmentation of 3D human action using HMM and multi-class Adaboost. In: ECCV. pp. 359–372 (2006)CrossRefGoogle Scholar
  27. 27.
    Pham, H., Khoudour, L., Crouzil, A., Zegers, P., Velastin, S.A.: Skeletal movement to color map: A novel representation for 3D action recognition with Inception Residual networks. In: IEEE International Conference on Image Processing (ICIP). pp. 3483–3487 (2018)Google Scholar
  28. 28.
    Pham, H.H., Khoudour, L., Crouzil, A., Zegers, P., Velastin, S.: Learning to Recognize 3D Human Action from A New Skeleton-based Representation Using Deep Convolutional Neural Networks. IET Computer Vision (2018)Google Scholar
  29. 29.
    Pham, H.H., Khoudour, L., Crouzil, A., Zegers, P., Velastin, S.A.: Exploiting deep residual networks for human action recognition from skeletal data. Computer Vision and Image Understanding 170, 51–66 (2018)CrossRefGoogle Scholar
  30. 30.
    Pizer, S.M., Amburn, E.P., Austin, J.D., Cromartie, R., Geselowitz, A., Greer, T., ter Haar Romeny, B., Zimmerman, J.B., Zuiderveld, K.: Adaptive histogram equalization and its variations. Computer Vision, Graphics, and Image Processing 39(3), 355–368 (1987)CrossRefGoogle Scholar
  31. 31.
    Poppe, R.: A survey on vision-based human action recognition. Image and Vision Computing 28(6), 976–990 (2010)CrossRefGoogle Scholar
  32. 32.
    Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks. In: IEEE ICASSP. pp. 4580–4584 (2015)Google Scholar
  33. 33.
    Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: A large scale dataset for 3D human activity analysis. In: IEEE CVPR. pp. 1010–1019 (2016)Google Scholar
  34. 34.
    Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., Cook, M., Moore, R.: Real-time human pose recognition in parts from single depth images. Communications of the ACM 56(1), 116–124 (2013)CrossRefGoogle Scholar
  35. 35.
    Si, C., Jing, Y., Wang, W., Wang, L., Tan, T.: Skeleton-based action recognition with spatial reasoning and temporal stack learning. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 103–118 (2018)CrossRefGoogle Scholar
  36. 36.
    Tanfous, A.B., Drira, H., Amor, B.B.: Coding Kendall’s shape trajectories for 3D action recognition. In: IEEE CVPR. pp. 2840–2849 (2018)Google Scholar
  37. 37.
    The Local: SNCF increases fines for ticket dodgers. (2015), published 20 February 2015. Accessed 10 July 2018
  38. 38.
    Veeriah, V., Zhuang, N., Qi, G.J.: Differential recurrent neural networks for action recognition. In: IEEE ICCV. pp. 4041–4049 (2015)Google Scholar
  39. 39.
    Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3D skeletons as points in a lie group. In: IEEE CVPR. pp. 588–595 (2014)Google Scholar
  40. 40.
    Wang, H., Wang, L.: Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: IEEE CVPR. pp. 3633–3642 (2017)Google Scholar
  41. 41.
    Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: IEEE CVPR. pp. 1290–1297 (2012)Google Scholar
  42. 42.
    Wang, P., Yuan, C., Hu, W., Li, B., Zhang, Y.: Graph based skeleton motion representation and similarity measurement for action recognition. In: ECCV. pp. 370–385 (2016)CrossRefGoogle Scholar
  43. 43.
    Wang, P., Li, W., Ogunbona, P., Gao, Z., Zhang, H.: Mining mid-level features for action recognition based on effective skeleton representation. In: International Conference on Digital Image Computing: Techniques and Applications (DICTA). pp. 1–8 (2014)Google Scholar
  44. 44.
    Weng, J., Weng, C., Yuan, J.: Spatio-temporal Naive-Bayes Nearest-Neighbor (ST-NBNN) for skeleton-based action recognition. In: IEEE CVPR. pp. 4171–4180 (2017)Google Scholar
  45. 45.
    Weng, J., Weng, C., Yuan, J., Liu, Z.: Discriminative spatio-temporal pattern discovery for 3D action recognition. IEEE Transactions on Circuits and Systems for Video Technology pp. 1–1 (2018)Google Scholar
  46. 46.
    Wu, D., Shao, L.: Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In: IEEE CVPR. pp. 724–731 (2014)Google Scholar
  47. 47.
    Xia, L., Chen, C.C., Aggarwal, J.: View invariant human action recognition using histograms of 3D joints. In: IEEE CVPR. pp. 20–27 (2012)Google Scholar
  48. 48.
    Xu, H., Chen, E., Liang, C., Qi, L., Guan, L.: Spatio-temporal pyramid model based on depth maps for action recognition. In: IEEE International Workshop on Multimedia Signal Processing (MMSP). pp. 1–6 (2015)Google Scholar
  49. 49.
    Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: IEEE CVPR. pp. 28–35 (2012)Google Scholar
  50. 50.
    Zhang, S., Liu, X., Xiao, J.: On geometric features for skeleton-based action recognition using multilayer LSTM networks. In: IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 148–157 (2017)Google Scholar
  51. 51.
    Zhu, W., et al.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI. p. 8 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Cerema, Equipe-projet STIToulouseFrance
  2. 2.Université Toulouse III - Paul Sabatier, Institut de Recherche en Informatique de ToulouseCedex 9, ToulouseFrance
  3. 3.AparnixSantiagoChile
  4. 4.Cortexica Vision Systems Ltd.LondonUK
  5. 5.Queen Mary University of London and Department of Computer ScienceUniversity Carlos III of MadridMadridSpain

Personalised recommendations