Advertisement

Signal, Image and Video Processing

, Volume 12, Issue 6, pp 1197–1205 | Cite as

Combining 2D and 3D deep models for action recognition with depth information

  • Ali Seydi Keçeli
  • Aydın Kaya
  • Ahmet Burak Can
Original Paper

Abstract

In activity recognition, usage of depth data is a rapidly growing research area. This paper presents a method for recognizing single-person activities and dyadic interactions by using deep features extracted from both 3D and 2D representations, which are constructed from depth sequences. First, a 3D volume representation is generated by considering spatiotemporal information in depth frames of an action sequence. Then, a 3D-CNN is trained to learn features from these 3D volume representations. In addition to this, a 2D representation is constructed from the weighted sum of the depth sequences. This 2D representation is used with a pre-trained CNN model. Features learned from this model and the 3D-CNN model are used in training of the final approach after a feature selection step. Among the various classifiers, an SVM-based model produced the best results. The proposed method was tested on the MSR-Action3D dataset for single-person activities, the SBU dataset for dyadic interactions, and the NTU RGB+D dataset for both types of actions. Experimental results show that proposed 3D and 2D representations and deep features extracted from them are robust and efficient. The proposed method achieves comparable results with the state of the art methods in the literature.

Keywords

Action recognition Dyadic actions Deep learning Feature selection RGB-D data 

References

  1. 1.
    Shechtman, E., Irani, M.: Space–time behavior based correlation. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), pp. 405–412. IEEE (2005)Google Scholar
  2. 2.
    Ke, Y., Sukthankar, R., Hebert, M.: Spatio-temporal shape and flow correlation for action recognition. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2007)Google Scholar
  3. 3.
    Pei, L.S., Ye, M., Zhao, X.Z., Xiang, T., Li, T.: Learning spatio-temporal features for action recognition from the side of the video. Signal Image Video Process. 10, 199–206 (2016)CrossRefGoogle Scholar
  4. 4.
    Ryoo, M., Chen, C.-C., Aggarwal, J., Roy-Chowdhury, A.: An overview of contest on semantic description of human activities (SDHA) 2010. In: Recognizing Patterns in Signals, Speech, Images and Videos, pp. 270–285. Springer (2010)Google Scholar
  5. 5.
    Zhang, C., Platt, J.C., Viola, P.A.: Multiple instance boosting for object detection. In: Advances in Neural Information Processing Systems, pp. 1417–1424 (2005)Google Scholar
  6. 6.
    Al Ghamdi, M., Zhang, L., Gotoh, Y.: Spatio-temporal SIFT and its application to human action classification. In: European Conference on Computer Vision, pp. 301–310. Springer (2012)Google Scholar
  7. 7.
    Waltisberg, D., Yao, A., Gall, J., Van Gool, L.: Variations of a hough-voting action recognition system. In: Recognizing Patterns in Signals, Speech, Images and Videos, pp. 306–312. Springer (2010)Google Scholar
  8. 8.
    Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 9–14. IEEE (2010)Google Scholar
  9. 9.
    Iosifidis, A., Tefas, A., Pitas, I.: View-invariant action recognition based on artificial neural networks. IEEE Trans. Neural Netw. Learn. Syst. 23, 412–24 (2012)CrossRefGoogle Scholar
  10. 10.
    Iosifidis, A., Tefas, A., Pitas, I.: Multi-view action recognition based on action volumes, fuzzy distances and cluster discriminant analysis. Signal Process. 93, 1445–57 (2013)CrossRefGoogle Scholar
  11. 11.
    Tsai, D.M., Chiu, W.Y., Lee, M.H.: Optical flow-motion history image (OF-MHI) for action recognition. Signal Image Video Process. 9, 1897–906 (2015)CrossRefGoogle Scholar
  12. 12.
    Mahbub, U., Imtiaz, H., Ahad, M.A.R.: Action recognition based on statistical analysis from clustered flow vectors. Signal Image Video Process. 8, 243–53 (2014)CrossRefGoogle Scholar
  13. 13.
    Xia, L., Chen, C.-C., Aggarwal, J.: View invariant human action recognition using histograms of 3D joints. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–27. IEEE (2012)Google Scholar
  14. 14.
    Raptis, M., Kirovski, D., Hoppe, H.: Real-time classification of dance gestures from skeleton animation. In: Proceedings of the 2011 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 147–156. ACM (2011)Google Scholar
  15. 15.
    Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., et al.: Real-time human pose recognition in parts from single depth images. Commun. ACM 56, 116–24 (2013)CrossRefGoogle Scholar
  16. 16.
    Sung, J., Ponce, C., Selman, B., Saxena, A.: Human activity detection from RGBD images. In: Plan, Activity, and Intent Recognition AAAI Workshop (2011)Google Scholar
  17. 17.
    Ryoo, M.S., Aggarwal, J.K.: Recognition of composite human activities through context-free grammar based representation. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pp. 1709–1718. IEEE (2006)Google Scholar
  18. 18.
    Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1593-1600. IEEE (2009)Google Scholar
  19. 19.
    Park, S., Aggarwal, J.K.: A hierarchical Bayesian network for event recognition of human actions and interactions. Multimed. Syst. 10, 164–79 (2004)CrossRefGoogle Scholar
  20. 20.
    Ji, Y., Ye, G., Cheng, H.: Interactive body part contrast mining for human interaction recognition. In: 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp. 1–6. IEEE (2014)Google Scholar
  21. 21.
    Ji, Y., Cheng, H., Zheng, Y., Li, H.: Learning contrastive feature distribution model for interaction recognition. J. Vis. Commun. Image Represent. 33, 340–9 (2015)CrossRefGoogle Scholar
  22. 22.
    Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 28–35. IEEE (2012)Google Scholar
  23. 23.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. 35, 221–31 (2013)CrossRefGoogle Scholar
  24. 24.
    Wu, D., Shao, L.: Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–731 (2014)Google Scholar
  25. 25.
    Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3361–3368. IEEE (2011)Google Scholar
  26. 26.
    Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: International Workshop on Human Behavior Understanding, pp. 29–39. Springer (2011)Google Scholar
  27. 27.
    Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.O.: Action recognition from depth maps using deep convolutional neural networks. IEEE Trans. Hum. Mach. Syst. 46, 498–509 (2016)CrossRefGoogle Scholar
  28. 28.
    Valle, E.A., Starostenko, O.: Recognition of human walking/running actions based on neural network. In: 2013 10th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), pp. 239–244. IEEE (2013)Google Scholar
  29. 29.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. IEEE (2015)Google Scholar
  30. 30.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  31. 31.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  32. 32.
    Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB plus D: a large scale dataset for 3D human activity analysis. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (Cpvr), pp. 1010-1019 (2016)Google Scholar
  33. 33.
    Mishkin, D., Sergievskiy, N., Matas, J.: Systematic evaluation of CNN advances on the ImageNet. arXiv preprint arXiv:1606.02228 (2016)
  34. 34.
    Kamnitsas, K., Ledig, C., Newcombe, V.F.J., Sirnpson, J.P., Kane, A.D., Menon, D.K., et al.: Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med. Image Anal. 36, 61–78 (2017)CrossRefGoogle Scholar
  35. 35.
    Shin, H.C., Roth, H.R., Gao, M.C., Lu, L., Xu, Z.Y., Nogues, I., et al.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imag. 35, 1285–98 (2016)CrossRefGoogle Scholar
  36. 36.
    Tang, Y.: Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013)
  37. 37.
    Keceli, A.S., Can, A.B.: A multimodal approach for recognizing human actions using depth information. Int. Conf. Pattern Recognit. 22, 421–426 (2014)Google Scholar
  38. 38.
    Ahad, M.A.R.: Motion history images for action recognition and understanding. Springer Science & Business Media, Berlin (2012)MATHGoogle Scholar
  39. 39.
    Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E. et al.: DeCAF: a deep convolutional activation feature for generic visual recognition. In: ICML, pp. 647–655 (2014)Google Scholar
  40. 40.
    Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: “How transferable are features in deep neural networks?”, Advances in neural information processing systems, 3320-8 (2014)Google Scholar
  41. 41.
    Kononenko, I., Šimec, E., Robnik-Šikonja, M.: Overcoming the myopia of inductive learning algorithms with RELIEFF. Appl Intell 7, 39–55 (1997)CrossRefGoogle Scholar
  42. 42.
    Yang, X., Tian, Y.L.: “Eigenjoints-based action recognition using naive-bayes-nearest-neighbor”, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, pp. 14-9 (2012)Google Scholar
  43. 43.
    Oreifej, O., Liu, Z.: “Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 716-23 (2013)Google Scholar
  44. 44.
    Yang, X., Zhang, C., Tian, Y.: “Recognizing actions using depth motion maps-based histograms of oriented gradients”, Proceedings of the 20th ACM international conference on Multimedia. ACM, pp. 1057-60 (2012)Google Scholar
  45. 45.
    Wang, J., Liu, Z.C., Wu, Y., Yuan, J.S.: “Mining Actionlet Ensemble for Action Recognition with Depth Cameras”. 2012 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), pp. 1290-7 (2012)Google Scholar
  46. 46.
    Ohn-Bar, E., Trivedi, M.: “Joint angles similarities and HOG2 for action recognition”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 465-70 (2013)Google Scholar
  47. 47.
    Liu, J., Shahroudy, A., Xu, D., Wang, G.: “Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition”. Computer Vision - Eccv 2016, Pt Iii, 9907, pp. 816-33 (2016)Google Scholar
  48. 48.
    Du, Y., Wang, W., Wang, H.: “Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition”. 2015 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), pp. 1110-8 (2015)Google Scholar
  49. 49.
    Liu, H., Tu, J., Liu, M.: “Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition”. arXiv preprint arXiv:1705.08106 (2017)
  50. 50.
    Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: “An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data”, AAAI, pp. 4263-70 (2017)Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2018

Authors and Affiliations

  • Ali Seydi Keçeli
    • 1
  • Aydın Kaya
    • 1
  • Ahmet Burak Can
    • 1
  1. 1.Department of Computer Engineering, Faculty of EngineeringHacettepe UniversityAnkaraTurkey

Personalised recommendations