Skip to main content
Log in

Depth based enlarged temporal dimension of 3D deep convolutional network for activity recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

An activity takes many seconds to complete which makes it a spatiotemporal structure. Many contemporary techniques tried to learn activity representation using convolutional neural network from such structures to recognize activities from videos. Nevertheless, these representation failed to learn complete activity because they utilized very few video frames for learning. In this work we use raw depth sequences considering its capabilities to record geometric information of objects and apply proposed enlarged time dimension convolution to learn features. Due to these properties, depth sequences are more discriminatory and insensitive to lighting changes as compared to RGB video. As we use raw depth data, time to do preprocessing are also saved. The 3 dimensional space-time filters have been used over increased time dimension for feature learning. Experimental results demonstrated that by lengthening the temporal resolution over raw depth data, accuracy of activity recognition has been improved significantly. We also studied the impact of different spatial resolution and conclude that accuracy stabilizes at larger spatial sizes. We shows the state-of-the-art results on three human activity recognition depth datasets: NTU-RGB + D, MSRAction3D and MSRDailyActivity3D.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. CVPR

  2. Bilen H, Fernando B, Gavves E, Vedaldi A (2017) Action recognition with dynamic image networks. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2017.2769085

    Article  Google Scholar 

  3. Chen C, Hou Z, Zhang B, Jiang J, Yang Y (2015) Gradient local autocorrelations and extreme learning machine for depth-based activity recognition. Adv Visual Comput Lect Notes Comput Sci: 613–623

  4. Davis JW, Bobick AF (1997) The representation and recognition of human movement using temporal templates. Comput Vision Pattern Recog. Proc 1997 IEEE Comput Soc Conf IEEE: 928–934

  5. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. CVPR

  6. Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. CVPR

  7. Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. CVPR

  8. Evangelidis G, Singh G, Horaud R (2014) Skeletal quads: Human action recognition using joint quadruples. ICPR

  9. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two stream network fusion for video action recognition. CVPR

  10. Hu J-F, Zheng W-S, Lai J, Zhang J (2015) Jointly learning heterogeneous features for rgb-d activity recognition. CVPR

  11. Ji S, Xu W, Yang M, Yu K (2010) 3D convolutional neural networks for human action recognition. ICML

  12. Jin L, Gao S, Li Z, Tang J (2014) Hand-crafted features or machine learnt features? together they improve rgb-d object recognition. Multimed (ISM), 2014 IEEE Int Symp IEEE: 311–319

  13. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. CVPR

  14. Khaire P, Kumar P, Imran J (2018) Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recogn Lett. https://doi.org/10.1016/j.patrec.2018.04.035

    Article  Google Scholar 

  15. Kim D, Yun W, Yoon H, Kim J (2014) Action recognition with depth maps using HOG descriptors of multi-view motion appearance and history. The eighth international conference on mobile ubiquitous computing, Systems, Services and Technologies, UBICOMM

  16. Klaser A, Marsza”lek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. BMVC 2008-19th British Machine Vision Conference, British Machine Vision Assoc: 275–1

  17. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. NIPS

  18. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In NIPS

  19. Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. CVPR

  20. LeCun Y, Boser B, Denker JS, Henderson D, Howard R, Hubbard W, Jackel L (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551

    Article  Google Scholar 

  21. Li W, Zhang Z, Liu Z (2008) Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Trans Circ Syst Video Technol 18(11):1499–1510

    Article  Google Scholar 

  22. Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. 2010 IEEE Comput Soc Conf Comput Vision Pattern Recogn-Workshops: 9–14

  23. Liu Z, Zhang C, Tian Y (2016) 3D-based deep convolutional neural network for action recognition with depth sequences. Image Vis Comput

  24. Lu C, Jia J, Tang C-K (2014) Range-sample depth feature for action recognition. In proceedings of the IEEE conference on computer vision and pattern recognition, pages 772–779

  25. J. Luo, W. Wang, and H. Qi. (2013) Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. Proc IEEE Int Conf Comput Vision: 1809–1816

  26. Luo Z, Peng B, Huang D-A, Alahi A, Fei-Fei L (2017) Unsupervised learning of long-term motion dynamics for videos. arXiv:1701.01821v3 [cs.CV]

  27. Ohn-Bar E, Trivedi M (2013) Joint angles similarities and hog2 for action recognition. CVPR Workshops

  28. Oreifej O, Liu Z (2013) Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences. Proc IEEE Conf Comput Vision Pattern Recogn: 716–723

  29. Rahmani H, Mahmood A, Huynh DQ, Mian A (2014) Hopc: histogram of oriented principal components of 3d pointclouds for action recognition. In European conference on computer vision, pages 742–757. Springer

  30. Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb+d: a large scale dataset for 3d human activity analysis. The IEEE Conf Comput Vision Pattern Recogn (CVPR)

  31. Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. Comput Vision Pattern Recogn (CVPR), IEEE Conf 2011:1297–1304

    Google Scholar 

  32. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. NIPS

  33. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. ICLR

  34. Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. CoRR, abs/1502.04681, 2

  35. Sung J, Ponce C, Selman B, Saxena A (2011) Human activity detection from rgbd images. Plan Activ Intent Recogn: 64

  36. Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. ECCV

  37. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. ICCV

  38. Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. Proc IEEE Conf Comput Vision Pattern Recogn: 588–595

  39. Wang H, Schmid C (2013) Action recognition with improved trajectories. ICCV

  40. Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. Proc Brit Mach Vis Conf London, U.K: 124.1–124.11

  41. Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012) Robust 3D action recognition with random occupancy patterns. Comput Vision e ECCV Lect Notes Comput Sci 2012:872–885

    Google Scholar 

  42. Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. CVPR

  43. Wang L, Qiao Y, Tang X (2014) Latent hierarchical model of temporal structure for complex activity classification. IEEE Trans Image Process 23(2):810–822

    Article  MathSciNet  Google Scholar 

  44. Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory pooled deep-convolutional descriptors. CVPR

  45. Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159

  46. Wang L, Xiong Y, Wang Z, Yu Q, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. arXiv: 1608.00859v1 [cs.CV]

  47. Wang P, Zhang J, Ogunbona PO (2015) Action recognition from depth maps using deep convolutional neural networks. IEEE Trans Human-Machine Syst

  48. Wang Y, Lin X, Wu L, Zhang W (2017) Effective multi-query expansions: collaborative deep networks for robust landmark retrieval. IEEE Trans Image Process 26(3):1393–1404

    Article  MathSciNet  Google Scholar 

  49. Wu L, Wang Y, Gao J, Li X (2018) Deep Adaptive Feature Embedding with Local Sample Distributions for Person Re-identification. Pattern Recognition 73:275–288

    Article  Google Scholar 

  50. Wu L, Wang Y, Li X, Gao J (2018) What-and-Where to Match: Deep Spatially Multiplicative Integration Networks for Person Re-identification. Pattern Recognition 76:727–738

    Article  Google Scholar 

  51. Wu L, Wang Y, Li X, Gao J (2018) Deep Attention-based Spatially Recursive Networks for Fine-Grained Visual Recognition. IEEE Trans Cybernet

  52. Xia L, Aggarwal J (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. Proc IEEE Conf Comput Vision Pattern Recog: 2834–2841

  53. Xu H, Das A, Saenko K (2017) R-C3D: Region convolutional 3D network for temporal activity detection. arXiv:1703.07814v2 [cs.CV]

  54. Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. Comput Vision Pattern Recogn (CVPR), 2014 IEEE Conference: 804–811

  55. Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. Proc ACM Conf Multimed, Nara, Japan: 1057–1060

  56. Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. NIPS

  57. Wu L, Shen C, Hengel A (2017) Deep linear discriminant analysis on fisher networks: a hybrid architecture for person re-identification. Pattern Recogn

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roshan Singh.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singh, R., Dhillon, J.K., Kushwaha, A.K.S. et al. Depth based enlarged temporal dimension of 3D deep convolutional network for activity recognition. Multimed Tools Appl 78, 30599–30614 (2019). https://doi.org/10.1007/s11042-018-6425-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6425-3

Keywords

Navigation