Sequential Deep Learning for Action Recognition with Synthetic Multi-view Data from Depth Maps

  • Bin LiangEmail author
  • Lihong Zheng
  • Xinying Li
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 996)


Recurrent neural network (RNN) has proven successful recently in action recognition. However, depth sequences are of high dimensionality and contain rich human dynamics, which makes traditional RNNs difficult to capture complex action information. This paper addresses the problem of human action recognition from sequences of depth maps using sequential deep learning. The proposed method first synthesizes multi-view depth sequences by rotating 3D point clouds from depth maps. Each depth sequence is then split into short-term temporal segments. For each segment, a multi-view depth motion template (MVDMT), which compresses the segment to a motion template, is constructed for short-term multi-view action representation. The MVDMT effectively characterizes the multi-view appearance and motion patterns within a short-term duration. Convolutional Neural Network (CNN) models are leveraged to extract features from MVDMT, and a CNN-RNN network is subsequently employed to learn an effective representation for sequential patterns of the multi-view depth sequence. The proposed multi-view sequential deep learning framework can simultaneously capture spatial-temporal appearance and motion features in the depth sequence. The proposed method has been evaluated on the MSR Action3D and MSR Action Pairs datasets, achieving promising results compared with the state-of-the-art methods based on depth data.


Action recognition Sequential deep learning Depth map Multi-view data 


  1. 1.
    Jaimes, A., Sebe, N.: Multimodal human-computer interaction: a survey. Comput. Vis. Image Underst. 108(1), 116–134 (2007)CrossRefGoogle Scholar
  2. 2.
    Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16(6), 345–379 (2010)CrossRefGoogle Scholar
  3. 3.
    Yang, X., Zhang, C., Tian, Y.L.: Recognizing actions using depth motion maps-based histograms of oriented gradients. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 1057–1060. ACM (2012)Google Scholar
  4. 4.
    Chen, C., Liu, K., Kehtarnavaz, N.: Real-time human action recognition based on depth motion maps. J. Real Time Image Proc. 12(1), 153–163 (2016). Scholar
  5. 5.
    Liang, B., Zheng, L.: 3D motion trail model based pyramid histograms of oriented gradient for action recognition. In: 2014 22nd International Conference on Pattern Recognition (ICPR), pp. 1952–1957. IEEE (2014)Google Scholar
  6. 6.
    Jetley, S., Cuzzolin, F.: 3D activity recognition using motion history and binary shape templates. In: Jawahar, C.V., Shan, S. (eds.) ACCV 2014, Part I. LNCS, vol. 9008, pp. 129–144. Springer, Cham (2015). Scholar
  7. 7.
    Liang, B., Zheng, L.: Specificity and latent correlation learning for action recognition using synthetic multi-view data from depth maps. IEEE Trans. Image Process. 26(12), 5560–5574 (2017)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Li, F.-F.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), pp. 248–255. IEEE (2009)Google Scholar
  9. 9.
    Liang, B., Zheng, L.: Spatio-temporal pyramid cuboid matching for action recognition using depth maps. In: 2015 IEEE International Conference on Image Processing (ICIP), pp. 2070–2074. IEEE (2015)Google Scholar
  10. 10.
    Ji, S., Wei, X., Yang, M., Kai, Y.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRefGoogle Scholar
  11. 11.
    Karpathy, A., George, T., Shetty, S., Leung, T., Sukthankar, R., Li, F.-F.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)Google Scholar
  12. 12.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  13. 13.
    Wu, Z., Wang, X., Jiang, Y.-G., Hao, Y., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 461–470. ACM (2015)Google Scholar
  14. 14.
    Smisek, J., Jancosek, M., Pajdla, T.: 3D with kinect. In: Fossati, A., Gall, J., Grabner, H., Ren, X., Konolige, K. (eds.) Consumer Depth Cameras for Computer Vision. Advances in Computer Vision and Pattern Recognition, pp. 3–25. Springer, London (2013). Scholar
  15. 15.
    Abidi, B.R., Zheng, Y., Gribok, A.V., Abidi, M.A.: Improving weapon detection in single energy X-ray images through pseudocoloring. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 36(6), 784–796 (2006)CrossRefGoogle Scholar
  16. 16.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)Google Scholar
  17. 17.
    Graves, A., Mohamed, A.-R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Speech and Signal Processing (ICASSP), pp. 6645–6649. IEEE (2013)Google Scholar
  18. 18.
    Xu, J., Yao, T., Zhang, Y., Mei, T.: Learning multimodal attention LSTM networks for video captioning. In: Proceedings of the 2017 ACM on Multimedia Conference, pp. 537–545. ACM (2017)Google Scholar
  19. 19.
    Shi, Y., Tian, Y., Wang, Y., Huang, T.: Sequential deep trajectory descriptor for action recognition with three-stream CNN. IEEE Trans. Multimedia 19, 1510–1520 (2017)CrossRefGoogle Scholar
  20. 20.
    Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 9–14. IEEE (2010)Google Scholar
  21. 21.
    Oreifej, O., Liu, Z.: HON4D: histogram of oriented 4D normals for activity recognition from depth sequences. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 716–723. IEEE (2013)Google Scholar
  22. 22.
    Chollet, F.: Keras (2015).
  23. 23.
    Vieira, A.W., Nascimento, E.R., Oliveira, G.L., Liu, Z., Campos, M.F.M.: STOP: space-time occupancy patterns for 3D action recognition from depth map sequences. In: Alvarez, L., Mejail, M., Gomez, L., Jacobo, J. (eds.) CIARP 2012. LNCS, vol. 7441, pp. 252–259. Springer, Heidelberg (2012). Scholar
  24. 24.
    Wang, J., Liu, Z., Chorowski, J., Chen, Z., Wu, Y.: Robust 3D action recognition with random occupancy patterns. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, pp. 872–885. Springer, Heidelberg (2012). Scholar
  25. 25.
    Xia, L., Aggarwal, J.K.: Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2834–2841. IEEE (2013)Google Scholar
  26. 26.
    Lu, C., Jia, J., Tang, C.-K.: Range-sample depth feature for action recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 772–779. IEEE (2014)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.University of Technology SydneySydneyAustralia
  2. 2.Charles Sturt UniversityWagga WaggaAustralia
  3. 3.Changchun University of TechnologyChangchunChina

Personalised recommendations