Abstract
RGB-D sensors have been in great demand due to its capability of producing large amount of multimodal data like RGB images and depth maps, useful for better training of deep learning models. In this paper, a deep learning model for recognizing human activities in a video sequence by combining multiple CNN streams has been proposed. The proposed work comprises the use of dynamic images generated from RGB images and depth map for three different dimensions. The proposed model is trained using these four streams on VGG Net for action recognition purpose. Further, it is evaluated and compared with the other state-of-the-art methods available in literature, on three challenging datasets, namely MSR daily Activity, UTD MHAD and CAD 60, in terms of accuracy, error, recall, specificity, precision and f-score. From obtained results, it has been observed that the proposed method outperforms other methods.
Similar content being viewed by others
References
Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28, 976–990 (2010)
Haritaoglu, I., Harwood, D., Davis, L.: W4: real-time surveillance of people and their activities. IEEE Trans. Pattern Anal. Mach. Intell. 22, 809–830 (2000)
Taylor, G., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. Lect. Notes Comput. Sci. 6316, 140–153 (2010)
Krizhevsky Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 12, 1097–1105 (2012)
Aggarwal, J., Ryoo, M.: Human activity analysis : a review. ACM Comput. Surv. 43, 1–43 (2011)
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD : a comprehensive multimodal human action database, Proceedings IEEE workshop on applications of computer vision (2013)
Yun, K., Honorio, J., Chattopadhyay, D., Berg, T., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning, Proceeding IEEE computer society conference on computer vision and pattern recognition workshops (2012)
Lin, L., Wang, K., Zuo, W., Wang, M., Luo, J., Zhang, L.: A deep structured model with radius margin bound for 3D human activity recognition. Int. J. Comput. Vision 118, 256–273 (2016)
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras, Proceedings IEEE Conference on computer vision and pattern recognition, pp. 1290–1297 (2012)
Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured human activity detection from RGBD images, Proceedings IEEE Conference on robotics and automation (2011)
Foggia, P., Percannella, G., Saggese, A., Vento, M.: Recognizing human actions by a bag of visual words, Proceeding IEEE International Conference on System, Man and Cybernetics, pp. 2910–2915 (2013)
Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD : a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor, Proceeding IEEE International Conference of Image Processing, pp. 168–172 (2015)
Zhang, J., Li W., Wang P., Ogunbona P., Liu S., Tang C. (2018) A Large Scale RGB-D Dataset for Action Recognition. Lecture Notes in Computer Science, 101–114
Oreifej, O., Liu, Z.: HON4D: histogram of oriented 4D normals for activity recognition from depth sequences, Proceedings IEEE Computer Vision and Pattern Recognition, pp. 716–723 (2013)
Yang, X., Tian, Y: Super normal vector for activity recognition using depth sequences, Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pp. 804–811 (2014)
Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp: 9–14 (2010)
Zhang, J., Wang, P., Tang, C., Li, W., Gao, Z., Ogunbona, P.: ConvNets-based action recognition from depth maps through virtual cameras and pseudocoloring, Proceedings of the 23rd ACM international conference on Multimedia, pp: 1119–1122 (2015)
Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.: Deep convolutional neural networks for action recognition using depth map sequences arXiv:1501.04686 (2015)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Proceedings 27th International Conference on Neural Information Processing Systems, vol. 1, pp: 568–576 (2014)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition, IEEE conference on computer vision and pattern recognition (2016)
Wang, L., Ge, L., Li, R., Fang, Y.: Three-stream CNNs for action recognition. Pattern Recogn. Lett. 92, 33–40 (2017)
Tu, Z., Xie, W., Qin, Q., Poppe, R., Veltkamp, R., Li, B., Yuan, J.: Multi-stream CNN: learning representations based on human-related regions for action recognition. Pattern Recogn. 79, 32–43 (2018)
Bilen, H., Fernando, B., Gavves, E., Vedaldi, A.: Action recognition with dynamic image networks. arXiv:1612.00738 (2016)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.: Large-scale video classification with convolutional neural networks, Proceedings IEEE Confernce of Computer Vision and Pattern Recognition, pp: 1725–1732 (2014)
Fernando, B., Gavves, E., Oramas, M., Ghodrati, A., Tuytelaars, T.: Rank pooling for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 773–787 (2017)
Yang, X., Zhang, C., Tian, Y.: Recognizing actions using depth motion maps-based histograms of oriented gradients. Proceedings of ACM International Conference on Multimedia, pp: 1057–1060 (2012)
Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.: Action recognition from depth maps using deep convolutional neural networks. IEEE Trans. Hum. Mach. Syst. 46, 498–509 (2016)
Simonyan, K., Zisserman A.: Very deep convolutional networks for large-scale image recognition. CoRR. arXiv:1409.1556 (2014)
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. CoRR. arXiv:1405.3531 (2014)
Althloothi, S., Mahoor, M., Zhang, X., Voyles, R.: Human activity recognition using multi-features and multiple kernel learning. Pattern Recogn. 47, 1800–1812 (2014)
Li, M., Leung, H., Shum, H.: Human action recognition via skeletal and depth based feature fusion, Proceedings 9th International Conference on Motion in Games, pp: 123–132 (2016)
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Learning action let ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 914–927 (2014)
Khaire, P., Kumar, P., Imran, J.: Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recogn. Lett. 115, 107–116 (2018)
Liu, S., Wang, H.: Human activities recognition based on skeleton information via sparse representation. J. Comput. Sci. Eng. 12, 1–11 (2018)
Li, C., Hou, Y., Wang, P., Member, S.: With convolutional neural networks. IEEE Signal Process. Lett. 24(5), 624–628 (2017)
Gaglio, S., Re, G., Morana, M.: Human activity recognition process using 3-D posture data. IEEE Trans. Hum. Mach. Syst. 45, 586–597 (2015)
Hu, J., Zheng, W., Lai, J., Zhang, J: Jointly learning heterogeneous features for RGB-D activity recognition, Proceeding IEEE Conference on Computer Vision and Pattern Recognition, pp: 5344–5352 (2015)
Triantaphyllou, E., Shu, B., Sanchez, S., Ray, T.: Multi-criteria decision making: an operations research approach. Encycl. Electr. Electron. Eng. 15, 175–186 (1998)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by F. Wu.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Singh, R., Khurana, R., Kushwaha, A.K.S. et al. Combining CNN streams of dynamic image and depth data for action recognition. Multimedia Systems 26, 313–322 (2020). https://doi.org/10.1007/s00530-019-00645-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-019-00645-5