Depth based enlarged temporal dimension of 3D deep convolutional network for activity recognition

Singh, Roshan; Dhillon, Jagwinder Kaur; Kushwaha, Alok Kumar Singh; Srivastava, Rajeev

doi:10.1007/s11042-018-6425-3

Depth based enlarged temporal dimension of 3D deep convolutional network for activity recognition

Published: 24 July 2018

Volume 78, pages 30599–30614, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Roshan Singh¹,
Jagwinder Kaur Dhillon²,
Alok Kumar Singh Kushwaha ORCID: orcid.org/0000-0002-8393-1294² &
…
Rajeev Srivastava¹

458 Accesses
14 Citations
Explore all metrics

Abstract

An activity takes many seconds to complete which makes it a spatiotemporal structure. Many contemporary techniques tried to learn activity representation using convolutional neural network from such structures to recognize activities from videos. Nevertheless, these representation failed to learn complete activity because they utilized very few video frames for learning. In this work we use raw depth sequences considering its capabilities to record geometric information of objects and apply proposed enlarged time dimension convolution to learn features. Due to these properties, depth sequences are more discriminatory and insensitive to lighting changes as compared to RGB video. As we use raw depth data, time to do preprocessing are also saved. The 3 dimensional space-time filters have been used over increased time dimension for feature learning. Experimental results demonstrated that by lengthening the temporal resolution over raw depth data, accuracy of activity recognition has been improved significantly. We also studied the impact of different spatial resolution and conclude that accuracy stabilizes at larger spatial sizes. We shows the state-of-the-art results on three human activity recognition depth datasets: NTU-RGB + D, MSRAction3D and MSRDailyActivity3D.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DTR-HAR: deep temporal residual representation for human activity recognition

Article 15 February 2021

An efficient end-to-end deep learning architecture for activity classification

Article 22 August 2018

SV-NET: A Deep Learning Approach to Video Based Human Activity Recognition

References

Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. CVPR
Bilen H, Fernando B, Gavves E, Vedaldi A (2017) Action recognition with dynamic image networks. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2017.2769085
Article Google Scholar
Chen C, Hou Z, Zhang B, Jiang J, Yang Y (2015) Gradient local autocorrelations and extreme learning machine for depth-based activity recognition. Adv Visual Comput Lect Notes Comput Sci: 613–623
Davis JW, Bobick AF (1997) The representation and recognition of human movement using temporal templates. Comput Vision Pattern Recog. Proc 1997 IEEE Comput Soc Conf IEEE: 928–934
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. CVPR
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. CVPR
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. CVPR
Evangelidis G, Singh G, Horaud R (2014) Skeletal quads: Human action recognition using joint quadruples. ICPR
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two stream network fusion for video action recognition. CVPR
Hu J-F, Zheng W-S, Lai J, Zhang J (2015) Jointly learning heterogeneous features for rgb-d activity recognition. CVPR
Ji S, Xu W, Yang M, Yu K (2010) 3D convolutional neural networks for human action recognition. ICML
Jin L, Gao S, Li Z, Tang J (2014) Hand-crafted features or machine learnt features? together they improve rgb-d object recognition. Multimed (ISM), 2014 IEEE Int Symp IEEE: 311–319
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. CVPR
Khaire P, Kumar P, Imran J (2018) Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recogn Lett. https://doi.org/10.1016/j.patrec.2018.04.035
Article Google Scholar
Kim D, Yun W, Yoon H, Kim J (2014) Action recognition with depth maps using HOG descriptors of multi-view motion appearance and history. The eighth international conference on mobile ubiquitous computing, Systems, Services and Technologies, UBICOMM
Klaser A, Marsza”lek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. BMVC 2008-19th British Machine Vision Conference, British Machine Vision Assoc: 275–1
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. NIPS
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In NIPS
Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. CVPR
LeCun Y, Boser B, Denker JS, Henderson D, Howard R, Hubbard W, Jackel L (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551
Article Google Scholar
Li W, Zhang Z, Liu Z (2008) Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Trans Circ Syst Video Technol 18(11):1499–1510
Article Google Scholar
Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. 2010 IEEE Comput Soc Conf Comput Vision Pattern Recogn-Workshops: 9–14
Liu Z, Zhang C, Tian Y (2016) 3D-based deep convolutional neural network for action recognition with depth sequences. Image Vis Comput
Lu C, Jia J, Tang C-K (2014) Range-sample depth feature for action recognition. In proceedings of the IEEE conference on computer vision and pattern recognition, pages 772–779
J. Luo, W. Wang, and H. Qi. (2013) Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. Proc IEEE Int Conf Comput Vision: 1809–1816
Luo Z, Peng B, Huang D-A, Alahi A, Fei-Fei L (2017) Unsupervised learning of long-term motion dynamics for videos. arXiv:1701.01821v3 [cs.CV]
Ohn-Bar E, Trivedi M (2013) Joint angles similarities and hog2 for action recognition. CVPR Workshops
Oreifej O, Liu Z (2013) Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences. Proc IEEE Conf Comput Vision Pattern Recogn: 716–723
Rahmani H, Mahmood A, Huynh DQ, Mian A (2014) Hopc: histogram of oriented principal components of 3d pointclouds for action recognition. In European conference on computer vision, pages 742–757. Springer
Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb+d: a large scale dataset for 3d human activity analysis. The IEEE Conf Comput Vision Pattern Recogn (CVPR)
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. Comput Vision Pattern Recogn (CVPR), IEEE Conf 2011:1297–1304
Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. NIPS
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. ICLR
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. CoRR, abs/1502.04681, 2
Sung J, Ponce C, Selman B, Saxena A (2011) Human activity detection from rgbd images. Plan Activ Intent Recogn: 64
Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. ECCV
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. ICCV
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. Proc IEEE Conf Comput Vision Pattern Recogn: 588–595
Wang H, Schmid C (2013) Action recognition with improved trajectories. ICCV
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. Proc Brit Mach Vis Conf London, U.K: 124.1–124.11
Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012) Robust 3D action recognition with random occupancy patterns. Comput Vision e ECCV Lect Notes Comput Sci 2012:872–885
Google Scholar
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. CVPR
Wang L, Qiao Y, Tang X (2014) Latent hierarchical model of temporal structure for complex activity classification. IEEE Trans Image Process 23(2):810–822
Article MathSciNet Google Scholar
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory pooled deep-convolutional descriptors. CVPR
Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159
Wang L, Xiong Y, Wang Z, Yu Q, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. arXiv: 1608.00859v1 [cs.CV]
Wang P, Zhang J, Ogunbona PO (2015) Action recognition from depth maps using deep convolutional neural networks. IEEE Trans Human-Machine Syst
Wang Y, Lin X, Wu L, Zhang W (2017) Effective multi-query expansions: collaborative deep networks for robust landmark retrieval. IEEE Trans Image Process 26(3):1393–1404
Article MathSciNet Google Scholar
Wu L, Wang Y, Gao J, Li X (2018) Deep Adaptive Feature Embedding with Local Sample Distributions for Person Re-identification. Pattern Recognition 73:275–288
Article Google Scholar
Wu L, Wang Y, Li X, Gao J (2018) What-and-Where to Match: Deep Spatially Multiplicative Integration Networks for Person Re-identification. Pattern Recognition 76:727–738
Article Google Scholar
Wu L, Wang Y, Li X, Gao J (2018) Deep Attention-based Spatially Recursive Networks for Fine-Grained Visual Recognition. IEEE Trans Cybernet
Xia L, Aggarwal J (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. Proc IEEE Conf Comput Vision Pattern Recog: 2834–2841
Xu H, Das A, Saenko K (2017) R-C3D: Region convolutional 3D network for temporal activity detection. arXiv:1703.07814v2 [cs.CV]
Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. Comput Vision Pattern Recogn (CVPR), 2014 IEEE Conference: 804–811
Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. Proc ACM Conf Multimed, Nara, Japan: 1057–1060
Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. NIPS
Wu L, Shen C, Hengel A (2017) Deep linear discriminant analysis on fisher networks: a hybrid architecture for person re-identification. Pattern Recogn

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, IIT (BHU), Varanasi, India
Roshan Singh & Rajeev Srivastava
Department of Computer Science and Engineering, IKGPTU, Kapurthala, India
Jagwinder Kaur Dhillon & Alok Kumar Singh Kushwaha

Authors

Roshan Singh
View author publications
You can also search for this author in PubMed Google Scholar
Jagwinder Kaur Dhillon
View author publications
You can also search for this author in PubMed Google Scholar
Alok Kumar Singh Kushwaha
View author publications
You can also search for this author in PubMed Google Scholar
Rajeev Srivastava
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roshan Singh.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Singh, R., Dhillon, J.K., Kushwaha, A.K.S. et al. Depth based enlarged temporal dimension of 3D deep convolutional network for activity recognition. Multimed Tools Appl 78, 30599–30614 (2019). https://doi.org/10.1007/s11042-018-6425-3

Download citation

Received: 29 May 2018
Revised: 13 July 2018
Accepted: 18 July 2018
Published: 24 July 2018
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11042-018-6425-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Depth based enlarged temporal dimension of 3D deep convolutional network for activity recognition

Abstract

Access this article

Similar content being viewed by others

DTR-HAR: deep temporal residual representation for human activity recognition

An efficient end-to-end deep learning architecture for activity classification

SV-NET: A Deep Learning Approach to Video Based Human Activity Recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Depth based enlarged temporal dimension of 3D deep convolutional network for activity recognition

Abstract

Access this article

Similar content being viewed by others

DTR-HAR: deep temporal residual representation for human activity recognition

An efficient end-to-end deep learning architecture for activity classification

SV-NET: A Deep Learning Approach to Video Based Human Activity Recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation