Multimedia Tools and Applications

, Volume 78, Issue 21, pp 30599–30614 | Cite as

Depth based enlarged temporal dimension of 3D deep convolutional network for activity recognition

  • Roshan SinghEmail author
  • Jagwinder Kaur Dhillon
  • Alok Kumar Singh Kushwaha
  • Rajeev Srivastava


An activity takes many seconds to complete which makes it a spatiotemporal structure. Many contemporary techniques tried to learn activity representation using convolutional neural network from such structures to recognize activities from videos. Nevertheless, these representation failed to learn complete activity because they utilized very few video frames for learning. In this work we use raw depth sequences considering its capabilities to record geometric information of objects and apply proposed enlarged time dimension convolution to learn features. Due to these properties, depth sequences are more discriminatory and insensitive to lighting changes as compared to RGB video. As we use raw depth data, time to do preprocessing are also saved. The 3 dimensional space-time filters have been used over increased time dimension for feature learning. Experimental results demonstrated that by lengthening the temporal resolution over raw depth data, accuracy of activity recognition has been improved significantly. We also studied the impact of different spatial resolution and conclude that accuracy stabilizes at larger spatial sizes. We shows the state-of-the-art results on three human activity recognition depth datasets: NTU-RGB + D, MSRAction3D and MSRDailyActivity3D.


Activity recognition Depth sequences Spatiotemporal convolutions 



  1. 1.
    Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. CVPRGoogle Scholar
  2. 2.
    Bilen H, Fernando B, Gavves E, Vedaldi A (2017) Action recognition with dynamic image networks. IEEE Trans Pattern Anal Mach Intell.
  3. 3.
    Chen C, Hou Z, Zhang B, Jiang J, Yang Y (2015) Gradient local autocorrelations and extreme learning machine for depth-based activity recognition. Adv Visual Comput Lect Notes Comput Sci: 613–623Google Scholar
  4. 4.
    Davis JW, Bobick AF (1997) The representation and recognition of human movement using temporal templates. Comput Vision Pattern Recog. Proc 1997 IEEE Comput Soc Conf IEEE: 928–934Google Scholar
  5. 5.
    Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. CVPRGoogle Scholar
  6. 6.
    Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. CVPRGoogle Scholar
  7. 7.
    Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. CVPRGoogle Scholar
  8. 8.
    Evangelidis G, Singh G, Horaud R (2014) Skeletal quads: Human action recognition using joint quadruples. ICPRGoogle Scholar
  9. 9.
    Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two stream network fusion for video action recognition. CVPRGoogle Scholar
  10. 10.
    Hu J-F, Zheng W-S, Lai J, Zhang J (2015) Jointly learning heterogeneous features for rgb-d activity recognition. CVPRGoogle Scholar
  11. 11.
    Ji S, Xu W, Yang M, Yu K (2010) 3D convolutional neural networks for human action recognition. ICMLGoogle Scholar
  12. 12.
    Jin L, Gao S, Li Z, Tang J (2014) Hand-crafted features or machine learnt features? together they improve rgb-d object recognition. Multimed (ISM), 2014 IEEE Int Symp IEEE: 311–319Google Scholar
  13. 13.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. CVPRGoogle Scholar
  14. 14.
    Khaire P, Kumar P, Imran J (2018) Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recogn Lett.
  15. 15.
    Kim D, Yun W, Yoon H, Kim J (2014) Action recognition with depth maps using HOG descriptors of multi-view motion appearance and history. The eighth international conference on mobile ubiquitous computing, Systems, Services and Technologies, UBICOMMGoogle Scholar
  16. 16.
    Klaser A, Marsza”lek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. BMVC 2008-19th British Machine Vision Conference, British Machine Vision Assoc: 275–1Google Scholar
  17. 17.
    Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. NIPSGoogle Scholar
  18. 18.
    Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In NIPSGoogle Scholar
  19. 19.
    Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. CVPRGoogle Scholar
  20. 20.
    LeCun Y, Boser B, Denker JS, Henderson D, Howard R, Hubbard W, Jackel L (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551CrossRefGoogle Scholar
  21. 21.
    Li W, Zhang Z, Liu Z (2008) Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Trans Circ Syst Video Technol 18(11):1499–1510CrossRefGoogle Scholar
  22. 22.
    Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. 2010 IEEE Comput Soc Conf Comput Vision Pattern Recogn-Workshops: 9–14Google Scholar
  23. 23.
    Liu Z, Zhang C, Tian Y (2016) 3D-based deep convolutional neural network for action recognition with depth sequences. Image Vis ComputGoogle Scholar
  24. 24.
    Lu C, Jia J, Tang C-K (2014) Range-sample depth feature for action recognition. In proceedings of the IEEE conference on computer vision and pattern recognition, pages 772–779Google Scholar
  25. 25.
    J. Luo, W. Wang, and H. Qi. (2013) Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. Proc IEEE Int Conf Comput Vision: 1809–1816Google Scholar
  26. 26.
    Luo Z, Peng B, Huang D-A, Alahi A, Fei-Fei L (2017) Unsupervised learning of long-term motion dynamics for videos. arXiv:1701.01821v3 [cs.CV]Google Scholar
  27. 27.
    Ohn-Bar E, Trivedi M (2013) Joint angles similarities and hog2 for action recognition. CVPR WorkshopsGoogle Scholar
  28. 28.
    Oreifej O, Liu Z (2013) Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences. Proc IEEE Conf Comput Vision Pattern Recogn: 716–723Google Scholar
  29. 29.
    Rahmani H, Mahmood A, Huynh DQ, Mian A (2014) Hopc: histogram of oriented principal components of 3d pointclouds for action recognition. In European conference on computer vision, pages 742–757. SpringerGoogle Scholar
  30. 30.
    Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb+d: a large scale dataset for 3d human activity analysis. The IEEE Conf Comput Vision Pattern Recogn (CVPR)Google Scholar
  31. 31.
    Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. Comput Vision Pattern Recogn (CVPR), IEEE Conf 2011:1297–1304Google Scholar
  32. 32.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. NIPSGoogle Scholar
  33. 33.
    Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. ICLRGoogle Scholar
  34. 34.
    Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. CoRR, abs/1502.04681, 2Google Scholar
  35. 35.
    Sung J, Ponce C, Selman B, Saxena A (2011) Human activity detection from rgbd images. Plan Activ Intent Recogn: 64Google Scholar
  36. 36.
    Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. ECCVGoogle Scholar
  37. 37.
    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. ICCVGoogle Scholar
  38. 38.
    Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. Proc IEEE Conf Comput Vision Pattern Recogn: 588–595Google Scholar
  39. 39.
    Wang H, Schmid C (2013) Action recognition with improved trajectories. ICCVGoogle Scholar
  40. 40.
    Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. Proc Brit Mach Vis Conf London, U.K: 124.1–124.11Google Scholar
  41. 41.
    Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012) Robust 3D action recognition with random occupancy patterns. Comput Vision e ECCV Lect Notes Comput Sci 2012:872–885Google Scholar
  42. 42.
    Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. CVPRGoogle Scholar
  43. 43.
    Wang L, Qiao Y, Tang X (2014) Latent hierarchical model of temporal structure for complex activity classification. IEEE Trans Image Process 23(2):810–822MathSciNetCrossRefGoogle Scholar
  44. 44.
    Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory pooled deep-convolutional descriptors. CVPRGoogle Scholar
  45. 45.
    Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159Google Scholar
  46. 46.
    Wang L, Xiong Y, Wang Z, Yu Q, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. arXiv: 1608.00859v1 [cs.CV]Google Scholar
  47. 47.
    Wang P, Zhang J, Ogunbona PO (2015) Action recognition from depth maps using deep convolutional neural networks. IEEE Trans Human-Machine SystGoogle Scholar
  48. 48.
    Wang Y, Lin X, Wu L, Zhang W (2017) Effective multi-query expansions: collaborative deep networks for robust landmark retrieval. IEEE Trans Image Process 26(3):1393–1404MathSciNetCrossRefGoogle Scholar
  49. 49.
    Wu L, Wang Y, Gao J, Li X (2018) Deep Adaptive Feature Embedding with Local Sample Distributions for Person Re-identification. Pattern Recognition 73:275–288CrossRefGoogle Scholar
  50. 50.
    Wu L, Wang Y, Li X, Gao J (2018) What-and-Where to Match: Deep Spatially Multiplicative Integration Networks for Person Re-identification. Pattern Recognition 76:727–738CrossRefGoogle Scholar
  51. 51.
    Wu L, Wang Y, Li X, Gao J (2018) Deep Attention-based Spatially Recursive Networks for Fine-Grained Visual Recognition. IEEE Trans CybernetGoogle Scholar
  52. 52.
    Xia L, Aggarwal J (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. Proc IEEE Conf Comput Vision Pattern Recog: 2834–2841Google Scholar
  53. 53.
    Xu H, Das A, Saenko K (2017) R-C3D: Region convolutional 3D network for temporal activity detection. arXiv:1703.07814v2 [cs.CV]Google Scholar
  54. 54.
    Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. Comput Vision Pattern Recogn (CVPR), 2014 IEEE Conference: 804–811Google Scholar
  55. 55.
    Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. Proc ACM Conf Multimed, Nara, Japan: 1057–1060Google Scholar
  56. 56.
    Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. NIPSGoogle Scholar
  57. 57.
    Wu L, Shen C, Hengel A (2017) Deep linear discriminant analysis on fisher networks: a hybrid architecture for person re-identification. Pattern RecognGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringIIT (BHU)VaranasiIndia
  2. 2.Department of Computer Science and EngineeringIKGPTUKapurthalaIndia

Personalised recommendations