Skip to main content
Log in

Combining CNN streams of dynamic image and depth data for action recognition

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

RGB-D sensors have been in great demand due to its capability of producing large amount of multimodal data like RGB images and depth maps, useful for better training of deep learning models. In this paper, a deep learning model for recognizing human activities in a video sequence by combining multiple CNN streams has been proposed. The proposed work comprises the use of dynamic images generated from RGB images and depth map for three different dimensions. The proposed model is trained using these four streams on VGG Net for action recognition purpose. Further, it is evaluated and compared with the other state-of-the-art methods available in literature, on three challenging datasets, namely MSR daily Activity, UTD MHAD and CAD 60, in terms of accuracy, error, recall, specificity, precision and f-score. From obtained results, it has been observed that the proposed method outperforms other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28, 976–990 (2010)

    Article  Google Scholar 

  2. Haritaoglu, I., Harwood, D., Davis, L.: W4: real-time surveillance of people and their activities. IEEE Trans. Pattern Anal. Mach. Intell. 22, 809–830 (2000)

    Article  Google Scholar 

  3. Taylor, G., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. Lect. Notes Comput. Sci. 6316, 140–153 (2010)

    Article  Google Scholar 

  4. Krizhevsky Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 12, 1097–1105 (2012)

    Google Scholar 

  5. Aggarwal, J., Ryoo, M.: Human activity analysis : a review. ACM Comput. Surv. 43, 1–43 (2011)

    Article  Google Scholar 

  6. Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD : a comprehensive multimodal human action database, Proceedings IEEE workshop on applications of computer vision (2013)

  7. Yun, K., Honorio, J., Chattopadhyay, D., Berg, T., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning, Proceeding IEEE computer society conference on computer vision and pattern recognition workshops (2012)

  8. Lin, L., Wang, K., Zuo, W., Wang, M., Luo, J., Zhang, L.: A deep structured model with radius margin bound for 3D human activity recognition. Int. J. Comput. Vision 118, 256–273 (2016)

    Article  MathSciNet  Google Scholar 

  9. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras, Proceedings IEEE Conference on computer vision and pattern recognition, pp. 1290–1297 (2012)

  10. Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured human activity detection from RGBD images, Proceedings IEEE Conference on robotics and automation (2011)

  11. Foggia, P., Percannella, G., Saggese, A., Vento, M.: Recognizing human actions by a bag of visual words, Proceeding IEEE International Conference on System, Man and Cybernetics, pp. 2910–2915 (2013)

  12. Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD : a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor, Proceeding IEEE International Conference of Image Processing, pp. 168–172 (2015)

  13. Zhang, J., Li W., Wang P., Ogunbona P., Liu S., Tang C. (2018) A Large Scale RGB-D Dataset for Action Recognition. Lecture Notes in Computer Science, 101–114

    Chapter  Google Scholar 

  14. Oreifej, O., Liu, Z.: HON4D: histogram of oriented 4D normals for activity recognition from depth sequences, Proceedings IEEE Computer Vision and Pattern Recognition, pp. 716–723 (2013)

  15. Yang, X., Tian, Y: Super normal vector for activity recognition using depth sequences, Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pp. 804–811 (2014)

  16. Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp: 9–14 (2010)

  17. Zhang, J., Wang, P., Tang, C., Li, W., Gao, Z., Ogunbona, P.: ConvNets-based action recognition from depth maps through virtual cameras and pseudocoloring, Proceedings of the 23rd ACM international conference on Multimedia, pp: 1119–1122 (2015)

  18. Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.: Deep convolutional neural networks for action recognition using depth map sequences arXiv:1501.04686 (2015)

  19. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Proceedings 27th International Conference on Neural Information Processing Systems, vol. 1, pp: 568–576 (2014)

  20. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition, IEEE conference on computer vision and pattern recognition (2016)

  21. Wang, L., Ge, L., Li, R., Fang, Y.: Three-stream CNNs for action recognition. Pattern Recogn. Lett. 92, 33–40 (2017)

    Article  Google Scholar 

  22. Tu, Z., Xie, W., Qin, Q., Poppe, R., Veltkamp, R., Li, B., Yuan, J.: Multi-stream CNN: learning representations based on human-related regions for action recognition. Pattern Recogn. 79, 32–43 (2018)

    Article  Google Scholar 

  23. Bilen, H., Fernando, B., Gavves, E., Vedaldi, A.: Action recognition with dynamic image networks. arXiv:1612.00738 (2016)

  24. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.: Large-scale video classification with convolutional neural networks, Proceedings IEEE Confernce of Computer Vision and Pattern Recognition, pp: 1725–1732 (2014)

  25. Fernando, B., Gavves, E., Oramas, M., Ghodrati, A., Tuytelaars, T.: Rank pooling for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 773–787 (2017)

    Article  Google Scholar 

  26. Yang, X., Zhang, C., Tian, Y.: Recognizing actions using depth motion maps-based histograms of oriented gradients. Proceedings of ACM International Conference on Multimedia, pp: 1057–1060 (2012)

  27. Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.: Action recognition from depth maps using deep convolutional neural networks. IEEE Trans. Hum. Mach. Syst. 46, 498–509 (2016)

    Article  Google Scholar 

  28. Simonyan, K., Zisserman A.: Very deep convolutional networks for large-scale image recognition. CoRR. arXiv:1409.1556 (2014)

  29. Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. CoRR. arXiv:1405.3531 (2014)

  30. Althloothi, S., Mahoor, M., Zhang, X., Voyles, R.: Human activity recognition using multi-features and multiple kernel learning. Pattern Recogn. 47, 1800–1812 (2014)

    Article  Google Scholar 

  31. Li, M., Leung, H., Shum, H.: Human action recognition via skeletal and depth based feature fusion, Proceedings 9th International Conference on Motion in Games, pp: 123–132 (2016)

  32. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Learning action let ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 914–927 (2014)

    Article  Google Scholar 

  33. Khaire, P., Kumar, P., Imran, J.: Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recogn. Lett. 115, 107–116 (2018)

    Article  Google Scholar 

  34. Liu, S., Wang, H.: Human activities recognition based on skeleton information via sparse representation. J. Comput. Sci. Eng. 12, 1–11 (2018)

    Article  Google Scholar 

  35. Li, C., Hou, Y., Wang, P., Member, S.: With convolutional neural networks. IEEE Signal Process. Lett. 24(5), 624–628 (2017)

    Article  Google Scholar 

  36. Gaglio, S., Re, G., Morana, M.: Human activity recognition process using 3-D posture data. IEEE Trans. Hum. Mach. Syst. 45, 586–597 (2015)

    Article  Google Scholar 

  37. Hu, J., Zheng, W., Lai, J., Zhang, J: Jointly learning heterogeneous features for RGB-D activity recognition, Proceeding IEEE Conference on Computer Vision and Pattern Recognition, pp: 5344–5352 (2015)

  38. Triantaphyllou, E., Shu, B., Sanchez, S., Ray, T.: Multi-criteria decision making: an operations research approach. Encycl. Electr. Electron. Eng. 15, 175–186 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roshan Singh.

Additional information

Communicated by F. Wu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singh, R., Khurana, R., Kushwaha, A.K.S. et al. Combining CNN streams of dynamic image and depth data for action recognition. Multimedia Systems 26, 313–322 (2020). https://doi.org/10.1007/s00530-019-00645-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-019-00645-5

Keywords

Navigation