A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data

Srihari, D.; Kishore, P. V. V.; Kumar, E. Kiran; Kumar, D. Anil; Kumar, M. Teja Kiran; Prasad, M. V. D.; Prasad, Ch. Raghava

doi:10.1007/s11042-019-08588-9

A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data

Published: 07 January 2020

Volume 79, pages 11723–11746, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

D. Srihari¹,
P. V. V. Kishore¹,
E. Kiran Kumar¹,
D. Anil Kumar¹,
M. Teja Kiran Kumar¹,
M. V. D. Prasad¹ &
…
Ch. Raghava Prasad¹

580 Accesses
16 Citations
Explore all metrics

Abstract

Appearance and depth-based action recognition has been researched exclusively for improving recognition accuracy by considering motion and shape recovery particulars from RGB-D video data. Convolutional neural networks (CNN) have shown evidences of superiority on action classification problems with spatial and apparent motion inputs. The current generation of CNNs use spatial RGB videos and depth maps to recognize action classes from RGB-D video. In this work, we propose a 4-stream CNN architecture that has two spatial RGB-D video data streams and two apparent motion streams, with inputs extracted from the optical flow of RGB-D videos. Each CNN stream is packed with 8 convolutional layers, 2 dense and one SoftMax layer, and a score fusion model to merge the scores from four streams. Performance of the proposed 4-stream action recognition framework is tested on our own action dataset and three benchmark datasets for action recognition. The usefulness of the proposed model is evaluated with state-of-the-art CNN architectures for action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Exploring hybrid spatio-temporal convolutional networks for human action recognition

Article 08 March 2017

Lightweight Action Recognition in Compressed Videos

References

Aggarwal J, Xia L (2014) Human activity recognition from 3d data: a review. Pattern Recogn Lett 48:70–80
Article Google Scholar
Bloom V, Makris D, Argyriou V (2012) G3d: a gaming action dataset and real time action recognition evaluation framework. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE
Boulahia SY, Anquetil E, Kulpa R, Multon F (2016) HIF3D: Handwriting-inspired features for 3d skeleton-based action recognition. In: 2016 23rd International Conference on Pattern Recognition (ICPR), IEEE
Burghouts G, Schutte K (2013) Spatio-temporal layout of human actions for improved bag-of-words action detection. Pattern Recogn Lett 34(15):1861–1869
Article Google Scholar
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: Delving deep into convolutional nets. In: Proceedings of the British machine vision conference 2014. British Machine Vision Association
Chen L, Wei H, Ferryman J (2014) Readingact RGB-d action dataset and human action recognition from local features. Pattern Recogn Lett 50:159–169
Article Google Scholar
Cheron G, Laptev I, Schmid C (2015) P-CNN: Pose-based CNN features for action recognition. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR) . IEEE
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. IEEE
Gammulle H, Denman S, Sridharan S, Fookes C (2017) Two stream lstm: a deep fusion framework for human action recognition. In: 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 177–186
Ghojogh B, Mohammadzade H, Mokari M (2018) Fisherposes for human action recognition using kinect sensor data. IEEE Sensors J 18(4):1612–1627
Article Google Scholar
Grest D, Krüijger V (2007) Gradient-enhanced p filter for visionbased motion capture. In: Human motion – understanding, modeling, capture and animation. Springer, Berlin, pp 28–41
Herbst E, Ren X, Fox D (2013) Rgb-d flow: Dense 3-d motion estimation using color and depth. In: 2013 IEEE international conference on robotics and automation, pp 2276–2282
Hu Q, Qin L, Huang Q-M (2014) A survey on visual human action recognition. Chinese J Comput 36(12):2512–2524
Article Google Scholar
Ijjina EP, Chalavadi KM (2016) Human action recognition using genetic algorithms and convolutional neural networks. Pattern Recogn 59:199–212
Article Google Scholar
Kakadiaris I, Barrón C. (2006) Model-based human motion capture. In: Hand of mathematical models in computer vision. Springer, pp 325–340
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei- Fei L (2014) Large-scale video classification with convolutional neural networks. In: 2014 IEEE conference on computer vision and pattern recognition. IEEE
Kishore P, Kumar D, Sastry A, Kumar E (2018) Motionlets matching with adaptive kernels for 3d indian sign language recognition. IEEE Sensors J:1–1
Koller O, Zargaran S, Ney H, Bowden R (2016) Deep sign: Hybrid CNN-HMM for continuous sign language recognition. In: Procedings of the British machine vision conference 2016. British Machine Vision Association
Lavinia Y, Vo HH, Verma A (2016) Fusion based deep CNN for improved large-scale image action recognition. In: 2016 IEEE international symposium on multimedia (ISM). IEEE
Li M, Leung H (2017) Graph-based approach for 3d human skeletal action recognition. Pattern Recogn Lett 87:195–202
Article Google Scholar
Li W, Li X, Qiu J (2015) Human action recognition based on dense of spatio-temporal interest points and HOG-3d descriptor. In: Proceedings of the 7th international conference on internet multimedia computing and service - ICIMCS ’15. ACM Press
Liu L, Hu F, Zhao J (2016) Action recognition based on features fusion and 3d convolutional neural networks. In: 2016 9th international symposium on computational intelligence and design (ISCID). IEEE
Liu M, Liu H, Chen C (2017) 3d action recognition using multi-scale energy-based global ternary image. IEEE Trans Circuits Sys Vid Technol:1–1
Ma M, Marturi N, Li Y, Leonardis A, Stolkin R (2018) Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos. Pattern Recogn 76:506–521
Article Google Scholar
Moeslund TB, Granum E (2001) A survey of computer vision-based human motion capture. Comput Vis Image Underst 81(3):231–268
Article Google Scholar
Ng J. Y. -H., Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: 2015 IEEE conference on computer vision and pattern recognition (cvpr). IEEE
Pichao W, Wanqing L, Jun W, Philip O, Xinwang L (2017) Cooperative training of deep aggregation networks for rgb-d action recognition. Computer Vision and Pattern Recognition
Presti LL, Cascia ML (2016) 3D skeleton-based human action classification: a survey. Pattern Recogn 53:130–147
Article Google Scholar
Qiu Z, Li Q, Yao T, Mei T, Rui Y (2015) Msr asia msm at thumos challenge 2015. In: CVPR workshop, vol 8
Rahmani H, Mahmood A, Huynh DQ, Mian A (2014) HOPC: Histogram Of oriented principal components of 3d pointclouds for action recognition. In: Computer Vision – ECCV 2014. Springer International Publishing, pp 742–757
Shahroudy A, Ng T. -T., Gong Y, Wang G (2017) Deep multimodal feature analysis for action recognition in RGB+d videos. In: IEEE Trans Pattern Anal Mach Intell: 1–1
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv:/1406.2199
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Singh B, Marks TK, Jones M, Tuzel O, Shao M (2016) A multi-stream bi-directional recurrent neural network for fine-grained action detection. Proceedings of the IEEE conference on computer vision and pattern recognition, 1961?1970
Song Y, Gu Y, Wang P, Liu Y, Li A (2013) A kinect based gesture recognition algorithm using GMM and HMM. In: 2013 6th international conference on biomedical engineering and informatics. IEEE
Sun Y, Bray M, Thayananthan A, Yuan B, Torr P (2006) Regressionbased human motion capture from voxel data. In: Procedings of the British Machine Vision Conference 2006. British Machine Vision Association
Tseng C-C, Chen J-C, Fang C-H, Lien J-JJ (2012) Human action recognition based on graph-embedded spatio-temporal subspace. Pattern Recogn 45(10):3611–3624
Article Google Scholar
Tu Z, Cao J, Li Y, Li B (2016) MSR-CNN: Applying motion salient region based descriptors for action recognition. In: 2016 23rd international conference on pattern recognition (ICPR). IEEE
Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell:1–1
Vonstad EK, Lervik E, Holt T, Ljosland M, Sandstrak G, Vereijken B, Nilsen JH (2017) P30: an open database of synchronized, high precision 3d motion capture data for human gait analysis research and development. Gait & Posture 57:241–242
Article Google Scholar
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE
Wang L, Ge L, Li R, Fang Y (2017) Three-stream CNNs for action recognition. Pattern Recogn Lett 92:33–40
Article Google Scholar
Wang X, Gao L, Wang P, Sun X, Liu X (2018) Two-stream 3-d ConvNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimedia 20(3):634–644
Article Google Scholar
Wang Y, Song J, Wang L, Gool L, Hilliges O (2016) Two-stream SRCNNs for action recognition in videos, British Machine Vision Association
Xia L, Chen C-C, Aggarwal JK (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE
Xiao T, Xia T, Yang Y, Huang C, Wang X (2015) Learning from massive noisy labeled data for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2691–2699
Yu G, Li T (2017) Recognition of human continuous action with 3d CNN. In: Lecture Notes in Computer Science. Springer International Publishing, pp 314–322
Zhang J, Li W, Ogunbona PO, Wang P, Tang C (2016) RGB-D-based action recognition datasets: a survey. Pattern Recogn 60:86–105
Article Google Scholar

Download references

Author information

Authors and Affiliations

Koneru Lakshmaiah Education Foundation, Green Fields, Vaddeswaram, India
D. Srihari, P. V. V. Kishore, E. Kiran Kumar, D. Anil Kumar, M. Teja Kiran Kumar, M. V. D. Prasad & Ch. Raghava Prasad

Authors

D. Srihari
View author publications
You can also search for this author in PubMed Google Scholar
P. V. V. Kishore
View author publications
You can also search for this author in PubMed Google Scholar
E. Kiran Kumar
View author publications
You can also search for this author in PubMed Google Scholar
D. Anil Kumar
View author publications
You can also search for this author in PubMed Google Scholar
M. Teja Kiran Kumar
View author publications
You can also search for this author in PubMed Google Scholar
M. V. D. Prasad
View author publications
You can also search for this author in PubMed Google Scholar
Ch. Raghava Prasad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to P. V. V. Kishore.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Srihari, D., Kishore, P.V.V., Kumar, E.K. et al. A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data. Multimed Tools Appl 79, 11723–11746 (2020). https://doi.org/10.1007/s11042-019-08588-9

Download citation

Received: 22 May 2018
Revised: 11 October 2019
Accepted: 13 December 2019
Published: 07 January 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11042-019-08588-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data

Abstract

Access this article

Similar content being viewed by others

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Exploring hybrid spatio-temporal convolutional networks for human action recognition

Lightweight Action Recognition in Compressed Videos

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data

Abstract

Access this article

Similar content being viewed by others

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Exploring hybrid spatio-temporal convolutional networks for human action recognition

Lightweight Action Recognition in Compressed Videos

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation