Abstract
An activity takes many seconds to complete which makes it a spatiotemporal structure. Many contemporary techniques tried to learn activity representation using convolutional neural network from such structures to recognize activities from videos. Nevertheless, these representation failed to learn complete activity because they utilized very few video frames for learning. In this work we use raw depth sequences considering its capabilities to record geometric information of objects and apply proposed enlarged time dimension convolution to learn features. Due to these properties, depth sequences are more discriminatory and insensitive to lighting changes as compared to RGB video. As we use raw depth data, time to do preprocessing are also saved. The 3 dimensional space-time filters have been used over increased time dimension for feature learning. Experimental results demonstrated that by lengthening the temporal resolution over raw depth data, accuracy of activity recognition has been improved significantly. We also studied the impact of different spatial resolution and conclude that accuracy stabilizes at larger spatial sizes. We shows the state-of-the-art results on three human activity recognition depth datasets: NTU-RGB + D, MSRAction3D and MSRDailyActivity3D.
Similar content being viewed by others
References
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. CVPR
Bilen H, Fernando B, Gavves E, Vedaldi A (2017) Action recognition with dynamic image networks. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2017.2769085
Chen C, Hou Z, Zhang B, Jiang J, Yang Y (2015) Gradient local autocorrelations and extreme learning machine for depth-based activity recognition. Adv Visual Comput Lect Notes Comput Sci: 613–623
Davis JW, Bobick AF (1997) The representation and recognition of human movement using temporal templates. Comput Vision Pattern Recog. Proc 1997 IEEE Comput Soc Conf IEEE: 928–934
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. CVPR
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. CVPR
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. CVPR
Evangelidis G, Singh G, Horaud R (2014) Skeletal quads: Human action recognition using joint quadruples. ICPR
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two stream network fusion for video action recognition. CVPR
Hu J-F, Zheng W-S, Lai J, Zhang J (2015) Jointly learning heterogeneous features for rgb-d activity recognition. CVPR
Ji S, Xu W, Yang M, Yu K (2010) 3D convolutional neural networks for human action recognition. ICML
Jin L, Gao S, Li Z, Tang J (2014) Hand-crafted features or machine learnt features? together they improve rgb-d object recognition. Multimed (ISM), 2014 IEEE Int Symp IEEE: 311–319
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. CVPR
Khaire P, Kumar P, Imran J (2018) Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recogn Lett. https://doi.org/10.1016/j.patrec.2018.04.035
Kim D, Yun W, Yoon H, Kim J (2014) Action recognition with depth maps using HOG descriptors of multi-view motion appearance and history. The eighth international conference on mobile ubiquitous computing, Systems, Services and Technologies, UBICOMM
Klaser A, Marsza”lek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. BMVC 2008-19th British Machine Vision Conference, British Machine Vision Assoc: 275–1
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. NIPS
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In NIPS
Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. CVPR
LeCun Y, Boser B, Denker JS, Henderson D, Howard R, Hubbard W, Jackel L (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551
Li W, Zhang Z, Liu Z (2008) Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Trans Circ Syst Video Technol 18(11):1499–1510
Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. 2010 IEEE Comput Soc Conf Comput Vision Pattern Recogn-Workshops: 9–14
Liu Z, Zhang C, Tian Y (2016) 3D-based deep convolutional neural network for action recognition with depth sequences. Image Vis Comput
Lu C, Jia J, Tang C-K (2014) Range-sample depth feature for action recognition. In proceedings of the IEEE conference on computer vision and pattern recognition, pages 772–779
J. Luo, W. Wang, and H. Qi. (2013) Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. Proc IEEE Int Conf Comput Vision: 1809–1816
Luo Z, Peng B, Huang D-A, Alahi A, Fei-Fei L (2017) Unsupervised learning of long-term motion dynamics for videos. arXiv:1701.01821v3 [cs.CV]
Ohn-Bar E, Trivedi M (2013) Joint angles similarities and hog2 for action recognition. CVPR Workshops
Oreifej O, Liu Z (2013) Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences. Proc IEEE Conf Comput Vision Pattern Recogn: 716–723
Rahmani H, Mahmood A, Huynh DQ, Mian A (2014) Hopc: histogram of oriented principal components of 3d pointclouds for action recognition. In European conference on computer vision, pages 742–757. Springer
Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb+d: a large scale dataset for 3d human activity analysis. The IEEE Conf Comput Vision Pattern Recogn (CVPR)
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. Comput Vision Pattern Recogn (CVPR), IEEE Conf 2011:1297–1304
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. NIPS
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. ICLR
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. CoRR, abs/1502.04681, 2
Sung J, Ponce C, Selman B, Saxena A (2011) Human activity detection from rgbd images. Plan Activ Intent Recogn: 64
Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. ECCV
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. ICCV
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. Proc IEEE Conf Comput Vision Pattern Recogn: 588–595
Wang H, Schmid C (2013) Action recognition with improved trajectories. ICCV
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. Proc Brit Mach Vis Conf London, U.K: 124.1–124.11
Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012) Robust 3D action recognition with random occupancy patterns. Comput Vision e ECCV Lect Notes Comput Sci 2012:872–885
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. CVPR
Wang L, Qiao Y, Tang X (2014) Latent hierarchical model of temporal structure for complex activity classification. IEEE Trans Image Process 23(2):810–822
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory pooled deep-convolutional descriptors. CVPR
Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159
Wang L, Xiong Y, Wang Z, Yu Q, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. arXiv: 1608.00859v1 [cs.CV]
Wang P, Zhang J, Ogunbona PO (2015) Action recognition from depth maps using deep convolutional neural networks. IEEE Trans Human-Machine Syst
Wang Y, Lin X, Wu L, Zhang W (2017) Effective multi-query expansions: collaborative deep networks for robust landmark retrieval. IEEE Trans Image Process 26(3):1393–1404
Wu L, Wang Y, Gao J, Li X (2018) Deep Adaptive Feature Embedding with Local Sample Distributions for Person Re-identification. Pattern Recognition 73:275–288
Wu L, Wang Y, Li X, Gao J (2018) What-and-Where to Match: Deep Spatially Multiplicative Integration Networks for Person Re-identification. Pattern Recognition 76:727–738
Wu L, Wang Y, Li X, Gao J (2018) Deep Attention-based Spatially Recursive Networks for Fine-Grained Visual Recognition. IEEE Trans Cybernet
Xia L, Aggarwal J (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. Proc IEEE Conf Comput Vision Pattern Recog: 2834–2841
Xu H, Das A, Saenko K (2017) R-C3D: Region convolutional 3D network for temporal activity detection. arXiv:1703.07814v2 [cs.CV]
Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. Comput Vision Pattern Recogn (CVPR), 2014 IEEE Conference: 804–811
Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. Proc ACM Conf Multimed, Nara, Japan: 1057–1060
Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. NIPS
Wu L, Shen C, Hengel A (2017) Deep linear discriminant analysis on fisher networks: a hybrid architecture for person re-identification. Pattern Recogn
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Singh, R., Dhillon, J.K., Kushwaha, A.K.S. et al. Depth based enlarged temporal dimension of 3D deep convolutional network for activity recognition. Multimed Tools Appl 78, 30599–30614 (2019). https://doi.org/10.1007/s11042-018-6425-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6425-3