Skip to main content
Log in

A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Appearance and depth-based action recognition has been researched exclusively for improving recognition accuracy by considering motion and shape recovery particulars from RGB-D video data. Convolutional neural networks (CNN) have shown evidences of superiority on action classification problems with spatial and apparent motion inputs. The current generation of CNNs use spatial RGB videos and depth maps to recognize action classes from RGB-D video. In this work, we propose a 4-stream CNN architecture that has two spatial RGB-D video data streams and two apparent motion streams, with inputs extracted from the optical flow of RGB-D videos. Each CNN stream is packed with 8 convolutional layers, 2 dense and one SoftMax layer, and a score fusion model to merge the scores from four streams. Performance of the proposed 4-stream action recognition framework is tested on our own action dataset and three benchmark datasets for action recognition. The usefulness of the proposed model is evaluated with state-of-the-art CNN architectures for action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Aggarwal J, Xia L (2014) Human activity recognition from 3d data: a review. Pattern Recogn Lett 48:70–80

    Article  Google Scholar 

  2. Bloom V, Makris D, Argyriou V (2012) G3d: a gaming action dataset and real time action recognition evaluation framework. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE

  3. Boulahia SY, Anquetil E, Kulpa R, Multon F (2016) HIF3D: Handwriting-inspired features for 3d skeleton-based action recognition. In: 2016 23rd International Conference on Pattern Recognition (ICPR), IEEE

  4. Burghouts G, Schutte K (2013) Spatio-temporal layout of human actions for improved bag-of-words action detection. Pattern Recogn Lett 34(15):1861–1869

    Article  Google Scholar 

  5. Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: Delving deep into convolutional nets. In: Proceedings of the British machine vision conference 2014. British Machine Vision Association

  6. Chen L, Wei H, Ferryman J (2014) Readingact RGB-d action dataset and human action recognition from local features. Pattern Recogn Lett 50:159–169

    Article  Google Scholar 

  7. Cheron G, Laptev I, Schmid C (2015) P-CNN: Pose-based CNN features for action recognition. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE

  8. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR) . IEEE

  9. Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. IEEE

  10. Gammulle H, Denman S, Sridharan S, Fookes C (2017) Two stream lstm: a deep fusion framework for human action recognition. In: 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 177–186

  11. Ghojogh B, Mohammadzade H, Mokari M (2018) Fisherposes for human action recognition using kinect sensor data. IEEE Sensors J 18(4):1612–1627

    Article  Google Scholar 

  12. Grest D, Krüijger V (2007) Gradient-enhanced p filter for visionbased motion capture. In: Human motion – understanding, modeling, capture and animation. Springer, Berlin, pp 28–41

  13. Herbst E, Ren X, Fox D (2013) Rgb-d flow: Dense 3-d motion estimation using color and depth. In: 2013 IEEE international conference on robotics and automation, pp 2276–2282

  14. Hu Q, Qin L, Huang Q-M (2014) A survey on visual human action recognition. Chinese J Comput 36(12):2512–2524

    Article  Google Scholar 

  15. Ijjina EP, Chalavadi KM (2016) Human action recognition using genetic algorithms and convolutional neural networks. Pattern Recogn 59:199–212

    Article  Google Scholar 

  16. Kakadiaris I, Barrón C. (2006) Model-based human motion capture. In: Hand of mathematical models in computer vision. Springer, pp 325–340

  17. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei- Fei L (2014) Large-scale video classification with convolutional neural networks. In: 2014 IEEE conference on computer vision and pattern recognition. IEEE

  18. Kishore P, Kumar D, Sastry A, Kumar E (2018) Motionlets matching with adaptive kernels for 3d indian sign language recognition. IEEE Sensors J:1–1

  19. Koller O, Zargaran S, Ney H, Bowden R (2016) Deep sign: Hybrid CNN-HMM for continuous sign language recognition. In: Procedings of the British machine vision conference 2016. British Machine Vision Association

  20. Lavinia Y, Vo HH, Verma A (2016) Fusion based deep CNN for improved large-scale image action recognition. In: 2016 IEEE international symposium on multimedia (ISM). IEEE

  21. Li M, Leung H (2017) Graph-based approach for 3d human skeletal action recognition. Pattern Recogn Lett 87:195–202

    Article  Google Scholar 

  22. Li W, Li X, Qiu J (2015) Human action recognition based on dense of spatio-temporal interest points and HOG-3d descriptor. In: Proceedings of the 7th international conference on internet multimedia computing and service - ICIMCS ’15. ACM Press

  23. Liu L, Hu F, Zhao J (2016) Action recognition based on features fusion and 3d convolutional neural networks. In: 2016 9th international symposium on computational intelligence and design (ISCID). IEEE

  24. Liu M, Liu H, Chen C (2017) 3d action recognition using multi-scale energy-based global ternary image. IEEE Trans Circuits Sys Vid Technol:1–1

  25. Ma M, Marturi N, Li Y, Leonardis A, Stolkin R (2018) Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos. Pattern Recogn 76:506–521

    Article  Google Scholar 

  26. Moeslund TB, Granum E (2001) A survey of computer vision-based human motion capture. Comput Vis Image Underst 81(3):231–268

    Article  Google Scholar 

  27. Ng J. Y. -H., Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: 2015 IEEE conference on computer vision and pattern recognition (cvpr). IEEE

  28. Pichao W, Wanqing L, Jun W, Philip O, Xinwang L (2017) Cooperative training of deep aggregation networks for rgb-d action recognition. Computer Vision and Pattern Recognition

  29. Presti LL, Cascia ML (2016) 3D skeleton-based human action classification: a survey. Pattern Recogn 53:130–147

    Article  Google Scholar 

  30. Qiu Z, Li Q, Yao T, Mei T, Rui Y (2015) Msr asia msm at thumos challenge 2015. In: CVPR workshop, vol 8

  31. Rahmani H, Mahmood A, Huynh DQ, Mian A (2014) HOPC: Histogram Of oriented principal components of 3d pointclouds for action recognition. In: Computer Vision – ECCV 2014. Springer International Publishing, pp 742–757

  32. Shahroudy A, Ng T. -T., Gong Y, Wang G (2017) Deep multimodal feature analysis for action recognition in RGB+d videos. In: IEEE Trans Pattern Anal Mach Intell: 1–1

  33. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv:/1406.2199

  34. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  35. Singh B, Marks TK, Jones M, Tuzel O, Shao M (2016) A multi-stream bi-directional recurrent neural network for fine-grained action detection. Proceedings of the IEEE conference on computer vision and pattern recognition, 1961?1970

  36. Song Y, Gu Y, Wang P, Liu Y, Li A (2013) A kinect based gesture recognition algorithm using GMM and HMM. In: 2013 6th international conference on biomedical engineering and informatics. IEEE

  37. Sun Y, Bray M, Thayananthan A, Yuan B, Torr P (2006) Regressionbased human motion capture from voxel data. In: Procedings of the British Machine Vision Conference 2006. British Machine Vision Association

  38. Tseng C-C, Chen J-C, Fang C-H, Lien J-JJ (2012) Human action recognition based on graph-embedded spatio-temporal subspace. Pattern Recogn 45(10):3611–3624

    Article  Google Scholar 

  39. Tu Z, Cao J, Li Y, Li B (2016) MSR-CNN: Applying motion salient region based descriptors for action recognition. In: 2016 23rd international conference on pattern recognition (ICPR). IEEE

  40. Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell:1–1

  41. Vonstad EK, Lervik E, Holt T, Ljosland M, Sandstrak G, Vereijken B, Nilsen JH (2017) P30: an open database of synchronized, high precision 3d motion capture data for human gait analysis research and development. Gait & Posture 57:241–242

    Article  Google Scholar 

  42. Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE

  43. Wang L, Ge L, Li R, Fang Y (2017) Three-stream CNNs for action recognition. Pattern Recogn Lett 92:33–40

    Article  Google Scholar 

  44. Wang X, Gao L, Wang P, Sun X, Liu X (2018) Two-stream 3-d ConvNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimedia 20(3):634–644

    Article  Google Scholar 

  45. Wang Y, Song J, Wang L, Gool L, Hilliges O (2016) Two-stream SRCNNs for action recognition in videos, British Machine Vision Association

  46. Xia L, Chen C-C, Aggarwal JK (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE

  47. Xiao T, Xia T, Yang Y, Huang C, Wang X (2015) Learning from massive noisy labeled data for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2691–2699

  48. Yu G, Li T (2017) Recognition of human continuous action with 3d CNN. In: Lecture Notes in Computer Science. Springer International Publishing, pp 314–322

  49. Zhang J, Li W, Ogunbona PO, Wang P, Tang C (2016) RGB-D-based action recognition datasets: a survey. Pattern Recogn 60:86–105

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to P. V. V. Kishore.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Srihari, D., Kishore, P.V.V., Kumar, E.K. et al. A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data. Multimed Tools Appl 79, 11723–11746 (2020). https://doi.org/10.1007/s11042-019-08588-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-08588-9

Keywords

Navigation