Combining CNN streams of dynamic image and depth data for action recognition

Singh, Roshan; Khurana, Rajat; Kushwaha, Alok Kumar Singh; Srivastava, Rajeev

doi:10.1007/s00530-019-00645-5

Combining CNN streams of dynamic image and depth data for action recognition

Regular Paper
Published: 14 January 2020

Volume 26, pages 313–322, (2020)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Roshan Singh ORCID: orcid.org/0000-0002-8527-1162¹,
Rajat Khurana²,
Alok Kumar Singh Kushwaha² &
…
Rajeev Srivastava¹

1095 Accesses
12 Citations
Explore all metrics

Abstract

RGB-D sensors have been in great demand due to its capability of producing large amount of multimodal data like RGB images and depth maps, useful for better training of deep learning models. In this paper, a deep learning model for recognizing human activities in a video sequence by combining multiple CNN streams has been proposed. The proposed work comprises the use of dynamic images generated from RGB images and depth map for three different dimensions. The proposed model is trained using these four streams on VGG Net for action recognition purpose. Further, it is evaluated and compared with the other state-of-the-art methods available in literature, on three challenging datasets, namely MSR daily Activity, UTD MHAD and CAD 60, in terms of accuracy, error, recall, specificity, precision and f-score. From obtained results, it has been observed that the proposed method outperforms other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Convolutional neural network: a review of models, methodologies and applications to object detection

Article 20 December 2019

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Article 10 June 2021

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

References

Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28, 976–990 (2010)
Article Google Scholar
Haritaoglu, I., Harwood, D., Davis, L.: W4: real-time surveillance of people and their activities. IEEE Trans. Pattern Anal. Mach. Intell. 22, 809–830 (2000)
Article Google Scholar
Taylor, G., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. Lect. Notes Comput. Sci. 6316, 140–153 (2010)
Article Google Scholar
Krizhevsky Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 12, 1097–1105 (2012)
Google Scholar
Aggarwal, J., Ryoo, M.: Human activity analysis : a review. ACM Comput. Surv. 43, 1–43 (2011)
Article Google Scholar
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD : a comprehensive multimodal human action database, Proceedings IEEE workshop on applications of computer vision (2013)
Yun, K., Honorio, J., Chattopadhyay, D., Berg, T., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning, Proceeding IEEE computer society conference on computer vision and pattern recognition workshops (2012)
Lin, L., Wang, K., Zuo, W., Wang, M., Luo, J., Zhang, L.: A deep structured model with radius margin bound for 3D human activity recognition. Int. J. Comput. Vision 118, 256–273 (2016)
Article MathSciNet Google Scholar
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras, Proceedings IEEE Conference on computer vision and pattern recognition, pp. 1290–1297 (2012)
Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured human activity detection from RGBD images, Proceedings IEEE Conference on robotics and automation (2011)
Foggia, P., Percannella, G., Saggese, A., Vento, M.: Recognizing human actions by a bag of visual words, Proceeding IEEE International Conference on System, Man and Cybernetics, pp. 2910–2915 (2013)
Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD : a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor, Proceeding IEEE International Conference of Image Processing, pp. 168–172 (2015)
Zhang, J., Li W., Wang P., Ogunbona P., Liu S., Tang C. (2018) A Large Scale RGB-D Dataset for Action Recognition. Lecture Notes in Computer Science, 101–114
Chapter Google Scholar
Oreifej, O., Liu, Z.: HON4D: histogram of oriented 4D normals for activity recognition from depth sequences, Proceedings IEEE Computer Vision and Pattern Recognition, pp. 716–723 (2013)
Yang, X., Tian, Y: Super normal vector for activity recognition using depth sequences, Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pp. 804–811 (2014)
Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp: 9–14 (2010)
Zhang, J., Wang, P., Tang, C., Li, W., Gao, Z., Ogunbona, P.: ConvNets-based action recognition from depth maps through virtual cameras and pseudocoloring, Proceedings of the 23rd ACM international conference on Multimedia, pp: 1119–1122 (2015)
Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.: Deep convolutional neural networks for action recognition using depth map sequences arXiv:1501.04686 (2015)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Proceedings 27th International Conference on Neural Information Processing Systems, vol. 1, pp: 568–576 (2014)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition, IEEE conference on computer vision and pattern recognition (2016)
Wang, L., Ge, L., Li, R., Fang, Y.: Three-stream CNNs for action recognition. Pattern Recogn. Lett. 92, 33–40 (2017)
Article Google Scholar
Tu, Z., Xie, W., Qin, Q., Poppe, R., Veltkamp, R., Li, B., Yuan, J.: Multi-stream CNN: learning representations based on human-related regions for action recognition. Pattern Recogn. 79, 32–43 (2018)
Article Google Scholar
Bilen, H., Fernando, B., Gavves, E., Vedaldi, A.: Action recognition with dynamic image networks. arXiv:1612.00738 (2016)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.: Large-scale video classification with convolutional neural networks, Proceedings IEEE Confernce of Computer Vision and Pattern Recognition, pp: 1725–1732 (2014)
Fernando, B., Gavves, E., Oramas, M., Ghodrati, A., Tuytelaars, T.: Rank pooling for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 773–787 (2017)
Article Google Scholar
Yang, X., Zhang, C., Tian, Y.: Recognizing actions using depth motion maps-based histograms of oriented gradients. Proceedings of ACM International Conference on Multimedia, pp: 1057–1060 (2012)
Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.: Action recognition from depth maps using deep convolutional neural networks. IEEE Trans. Hum. Mach. Syst. 46, 498–509 (2016)
Article Google Scholar
Simonyan, K., Zisserman A.: Very deep convolutional networks for large-scale image recognition. CoRR. arXiv:1409.1556 (2014)
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. CoRR. arXiv:1405.3531 (2014)
Althloothi, S., Mahoor, M., Zhang, X., Voyles, R.: Human activity recognition using multi-features and multiple kernel learning. Pattern Recogn. 47, 1800–1812 (2014)
Article Google Scholar
Li, M., Leung, H., Shum, H.: Human action recognition via skeletal and depth based feature fusion, Proceedings 9th International Conference on Motion in Games, pp: 123–132 (2016)
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Learning action let ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 914–927 (2014)
Article Google Scholar
Khaire, P., Kumar, P., Imran, J.: Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recogn. Lett. 115, 107–116 (2018)
Article Google Scholar
Liu, S., Wang, H.: Human activities recognition based on skeleton information via sparse representation. J. Comput. Sci. Eng. 12, 1–11 (2018)
Article Google Scholar
Li, C., Hou, Y., Wang, P., Member, S.: With convolutional neural networks. IEEE Signal Process. Lett. 24(5), 624–628 (2017)
Article Google Scholar
Gaglio, S., Re, G., Morana, M.: Human activity recognition process using 3-D posture data. IEEE Trans. Hum. Mach. Syst. 45, 586–597 (2015)
Article Google Scholar
Hu, J., Zheng, W., Lai, J., Zhang, J: Jointly learning heterogeneous features for RGB-D activity recognition, Proceeding IEEE Conference on Computer Vision and Pattern Recognition, pp: 5344–5352 (2015)
Triantaphyllou, E., Shu, B., Sanchez, S., Ray, T.: Multi-criteria decision making: an operations research approach. Encycl. Electr. Electron. Eng. 15, 175–186 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, IIT (BHU), Varanasi, India
Roshan Singh & Rajeev Srivastava
Department of Computer Science and Engineering, IKG Punjab Technical University, Punjab, India
Rajat Khurana & Alok Kumar Singh Kushwaha

Authors

Roshan Singh
View author publications
You can also search for this author in PubMed Google Scholar
Rajat Khurana
View author publications
You can also search for this author in PubMed Google Scholar
Alok Kumar Singh Kushwaha
View author publications
You can also search for this author in PubMed Google Scholar
Rajeev Srivastava
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roshan Singh.

Additional information

Communicated by F. Wu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Singh, R., Khurana, R., Kushwaha, A.K.S. et al. Combining CNN streams of dynamic image and depth data for action recognition. Multimedia Systems 26, 313–322 (2020). https://doi.org/10.1007/s00530-019-00645-5

Download citation

Received: 21 May 2019
Accepted: 30 December 2019
Published: 14 January 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s00530-019-00645-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining CNN streams of dynamic image and depth data for action recognition

Abstract

Access this article

Similar content being viewed by others

Convolutional neural network: a review of models, methodologies and applications to object detection

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Video summarization using deep learning techniques: a detailed analysis and investigation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Combining CNN streams of dynamic image and depth data for action recognition

Abstract

Access this article

Similar content being viewed by others

Convolutional neural network: a review of models, methodologies and applications to object detection

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Video summarization using deep learning techniques: a detailed analysis and investigation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation