Abstract
Appearance and depth-based action recognition has been researched exclusively for improving recognition accuracy by considering motion and shape recovery particulars from RGB-D video data. Convolutional neural networks (CNN) have shown evidences of superiority on action classification problems with spatial and apparent motion inputs. The current generation of CNNs use spatial RGB videos and depth maps to recognize action classes from RGB-D video. In this work, we propose a 4-stream CNN architecture that has two spatial RGB-D video data streams and two apparent motion streams, with inputs extracted from the optical flow of RGB-D videos. Each CNN stream is packed with 8 convolutional layers, 2 dense and one SoftMax layer, and a score fusion model to merge the scores from four streams. Performance of the proposed 4-stream action recognition framework is tested on our own action dataset and three benchmark datasets for action recognition. The usefulness of the proposed model is evaluated with state-of-the-art CNN architectures for action recognition.
Similar content being viewed by others
References
Aggarwal J, Xia L (2014) Human activity recognition from 3d data: a review. Pattern Recogn Lett 48:70–80
Bloom V, Makris D, Argyriou V (2012) G3d: a gaming action dataset and real time action recognition evaluation framework. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE
Boulahia SY, Anquetil E, Kulpa R, Multon F (2016) HIF3D: Handwriting-inspired features for 3d skeleton-based action recognition. In: 2016 23rd International Conference on Pattern Recognition (ICPR), IEEE
Burghouts G, Schutte K (2013) Spatio-temporal layout of human actions for improved bag-of-words action detection. Pattern Recogn Lett 34(15):1861–1869
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: Delving deep into convolutional nets. In: Proceedings of the British machine vision conference 2014. British Machine Vision Association
Chen L, Wei H, Ferryman J (2014) Readingact RGB-d action dataset and human action recognition from local features. Pattern Recogn Lett 50:159–169
Cheron G, Laptev I, Schmid C (2015) P-CNN: Pose-based CNN features for action recognition. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR) . IEEE
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. IEEE
Gammulle H, Denman S, Sridharan S, Fookes C (2017) Two stream lstm: a deep fusion framework for human action recognition. In: 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 177–186
Ghojogh B, Mohammadzade H, Mokari M (2018) Fisherposes for human action recognition using kinect sensor data. IEEE Sensors J 18(4):1612–1627
Grest D, Krüijger V (2007) Gradient-enhanced p filter for visionbased motion capture. In: Human motion – understanding, modeling, capture and animation. Springer, Berlin, pp 28–41
Herbst E, Ren X, Fox D (2013) Rgb-d flow: Dense 3-d motion estimation using color and depth. In: 2013 IEEE international conference on robotics and automation, pp 2276–2282
Hu Q, Qin L, Huang Q-M (2014) A survey on visual human action recognition. Chinese J Comput 36(12):2512–2524
Ijjina EP, Chalavadi KM (2016) Human action recognition using genetic algorithms and convolutional neural networks. Pattern Recogn 59:199–212
Kakadiaris I, Barrón C. (2006) Model-based human motion capture. In: Hand of mathematical models in computer vision. Springer, pp 325–340
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei- Fei L (2014) Large-scale video classification with convolutional neural networks. In: 2014 IEEE conference on computer vision and pattern recognition. IEEE
Kishore P, Kumar D, Sastry A, Kumar E (2018) Motionlets matching with adaptive kernels for 3d indian sign language recognition. IEEE Sensors J:1–1
Koller O, Zargaran S, Ney H, Bowden R (2016) Deep sign: Hybrid CNN-HMM for continuous sign language recognition. In: Procedings of the British machine vision conference 2016. British Machine Vision Association
Lavinia Y, Vo HH, Verma A (2016) Fusion based deep CNN for improved large-scale image action recognition. In: 2016 IEEE international symposium on multimedia (ISM). IEEE
Li M, Leung H (2017) Graph-based approach for 3d human skeletal action recognition. Pattern Recogn Lett 87:195–202
Li W, Li X, Qiu J (2015) Human action recognition based on dense of spatio-temporal interest points and HOG-3d descriptor. In: Proceedings of the 7th international conference on internet multimedia computing and service - ICIMCS ’15. ACM Press
Liu L, Hu F, Zhao J (2016) Action recognition based on features fusion and 3d convolutional neural networks. In: 2016 9th international symposium on computational intelligence and design (ISCID). IEEE
Liu M, Liu H, Chen C (2017) 3d action recognition using multi-scale energy-based global ternary image. IEEE Trans Circuits Sys Vid Technol:1–1
Ma M, Marturi N, Li Y, Leonardis A, Stolkin R (2018) Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos. Pattern Recogn 76:506–521
Moeslund TB, Granum E (2001) A survey of computer vision-based human motion capture. Comput Vis Image Underst 81(3):231–268
Ng J. Y. -H., Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: 2015 IEEE conference on computer vision and pattern recognition (cvpr). IEEE
Pichao W, Wanqing L, Jun W, Philip O, Xinwang L (2017) Cooperative training of deep aggregation networks for rgb-d action recognition. Computer Vision and Pattern Recognition
Presti LL, Cascia ML (2016) 3D skeleton-based human action classification: a survey. Pattern Recogn 53:130–147
Qiu Z, Li Q, Yao T, Mei T, Rui Y (2015) Msr asia msm at thumos challenge 2015. In: CVPR workshop, vol 8
Rahmani H, Mahmood A, Huynh DQ, Mian A (2014) HOPC: Histogram Of oriented principal components of 3d pointclouds for action recognition. In: Computer Vision – ECCV 2014. Springer International Publishing, pp 742–757
Shahroudy A, Ng T. -T., Gong Y, Wang G (2017) Deep multimodal feature analysis for action recognition in RGB+d videos. In: IEEE Trans Pattern Anal Mach Intell: 1–1
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv:/1406.2199
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Singh B, Marks TK, Jones M, Tuzel O, Shao M (2016) A multi-stream bi-directional recurrent neural network for fine-grained action detection. Proceedings of the IEEE conference on computer vision and pattern recognition, 1961?1970
Song Y, Gu Y, Wang P, Liu Y, Li A (2013) A kinect based gesture recognition algorithm using GMM and HMM. In: 2013 6th international conference on biomedical engineering and informatics. IEEE
Sun Y, Bray M, Thayananthan A, Yuan B, Torr P (2006) Regressionbased human motion capture from voxel data. In: Procedings of the British Machine Vision Conference 2006. British Machine Vision Association
Tseng C-C, Chen J-C, Fang C-H, Lien J-JJ (2012) Human action recognition based on graph-embedded spatio-temporal subspace. Pattern Recogn 45(10):3611–3624
Tu Z, Cao J, Li Y, Li B (2016) MSR-CNN: Applying motion salient region based descriptors for action recognition. In: 2016 23rd international conference on pattern recognition (ICPR). IEEE
Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell:1–1
Vonstad EK, Lervik E, Holt T, Ljosland M, Sandstrak G, Vereijken B, Nilsen JH (2017) P30: an open database of synchronized, high precision 3d motion capture data for human gait analysis research and development. Gait & Posture 57:241–242
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE
Wang L, Ge L, Li R, Fang Y (2017) Three-stream CNNs for action recognition. Pattern Recogn Lett 92:33–40
Wang X, Gao L, Wang P, Sun X, Liu X (2018) Two-stream 3-d ConvNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimedia 20(3):634–644
Wang Y, Song J, Wang L, Gool L, Hilliges O (2016) Two-stream SRCNNs for action recognition in videos, British Machine Vision Association
Xia L, Chen C-C, Aggarwal JK (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE
Xiao T, Xia T, Yang Y, Huang C, Wang X (2015) Learning from massive noisy labeled data for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2691–2699
Yu G, Li T (2017) Recognition of human continuous action with 3d CNN. In: Lecture Notes in Computer Science. Springer International Publishing, pp 314–322
Zhang J, Li W, Ogunbona PO, Wang P, Tang C (2016) RGB-D-based action recognition datasets: a survey. Pattern Recogn 60:86–105
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Srihari, D., Kishore, P.V.V., Kumar, E.K. et al. A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data. Multimed Tools Appl 79, 11723–11746 (2020). https://doi.org/10.1007/s11042-019-08588-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08588-9