Advertisement

Discriminative Feature Learning with Constraints of Category and Temporal for Action Recognition

  • Zhize Wu
  • Shouhong Wan
  • Peiquan Jin
  • Lihua  Yue
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9218)

Abstract

Recently, with the availability of the depth cameras, a lot of studies of human action recognition have been conducted on the depth sequences. Motivated by the observations that each pose has its relative location during a complete action sequence, and similar actions have the fine spatio-temporal differences. We propose a novel method to recognize human actions based on the depth information in this paper. Representations of depth maps are learned and reconstructed using a stacked denoising autoencoder. By adding the category and temporal constraints, the learned features are more discriminative, able to capture the subtle but significant differences between actions, and mitigate the nuisance variability of temporal misalignment. Greedy layer-wise training strategy is used to train the deep neural network. Then we employ temporal pyramid matching on the feature representation to generate temporal representation. Finally a linear SVM is trained to classify each sequence into actions. We compare our proposal on MSR Action3D dataset with the previous methods, and the results shown that the proposed method significantly outperforms traditional model, and comparable to, state-of-art action recognition performance. Experimental results also indicate the great power of our model to restore highly noisy input data.

Keywords

Action recognition Category Temporal Feature learning Stacked denoising autoencoders 

Notes

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant No. 61272317) and the General Program of Natural Science Foundation of Anhui of China (Grant No. 1208085MF90).

References

  1. 1.
    Bengio, Y.: Learning deep architectures for AI. Found. Trends\(\textregistered \) Mach. Learn. 2(1), 1–127 (2009)Google Scholar
  2. 2.
    Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)CrossRefGoogle Scholar
  3. 3.
    Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)Google Scholar
  4. 4.
    Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005, pp. 65–72. IEEE (2005)Google Scholar
  5. 5.
    Hinton, G., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)CrossRefGoogle Scholar
  8. 8.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8. IEEE (2008)Google Scholar
  9. 9.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2169–2178. IEEE (2006)Google Scholar
  10. 10.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  11. 11.
    Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3d points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 9–14. IEEE (2010)Google Scholar
  12. 12.
    Luo, J., Wang, W., Qi, H.: Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 1809–1816. IEEE (2013)Google Scholar
  13. 13.
    Martens, J., Sutskever, I.: Learning recurrent neural networks with hessian-free optimization. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1033–1040 (2011)Google Scholar
  14. 14.
    Oreifej, O., Liu, Z.: Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 716–723. IEEE (2013)Google Scholar
  15. 15.
    Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)CrossRefGoogle Scholar
  16. 16.
    Sang, R., Jin, P., Wan, S.: Discriminative feature learning for action recognition using a stacked denoising autoencoder. In: Pan, J.-S., Snasel, V., Corchado, E.S., Abraham, A., Wang, S.-L. (eds.) Intelligent Data Analysis and Its Applications, Volume I. AISC, vol. 297, pp. 521–531. Springer, Heidelberg (2014) CrossRefGoogle Scholar
  17. 17.
    Scholkopft, B., Mullert, K.-R.: Fisher discriminant analysis with kernels. In: Proceedings of the 1999 IEEE Signal Processing Society Workshop Neural Networks for Signal Processing IX, Madison, WI, USA, pp. 23–25 (1999)Google Scholar
  18. 18.
    Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., Cook, M., Moore, R.: Real-time human pose recognition in parts from single depth images. Commun. ACM 56(1), 116–124 (2013)CrossRefGoogle Scholar
  19. 19.
    Su, J., Srivastava, A., de Souza, F.D.M., Sarkar, S.: Rate-invariant analysis of trajectories on riemannian manifolds with application in visual speech recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 620–627. IEEE (2014)Google Scholar
  20. 20.
    Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Wang, J., Liu, Z., Chorowski, J., Chen, Z., Wu, Y.: Robust 3D action recognition with random occupancy patterns. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 872–885. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  22. 22.
    Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1290–1297. IEEE (2012)Google Scholar
  23. 23.
    Wang, J., Wu, Y.: Learning maximum margin temporal warping for action recognition. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 2688–2695. IEEE (2013)Google Scholar
  24. 24.
    Wu, Z., Wan, S., Yue, L., Sang, R.: Discriminative image representation for classification. In: Pan, J.-S., Snasel, V., Corchado, E.S., Abraham, A., Wang, S.-L. (eds.) Intelligent Data Analysis and Its Applications, Volume II. AISC, vol. 298, pp. 331–341. Springer, Heidelberg (2014) Google Scholar
  25. 25.
    Xia, L., Aggarwal, J.K.: Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2834–2841. IEEE (2013)Google Scholar
  26. 26.
    Xia, L., Chen, C.-C., Aggarwal, J.K.: View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 20–27. IEEE (2012)Google Scholar
  27. 27.
    Yang, X., Tian, Y.: Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 14–19. IEEE (2012)Google Scholar
  28. 28.
    Yang, X., Zhang, C., Tian, Y.: Recognizing actions using depth motion maps-based histograms of oriented gradients. In: Proceedings of the 20th ACM international conference on Multimedia, pp. 1057–1060. ACM (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Zhize Wu
    • 1
  • Shouhong Wan
    • 1
  • Peiquan Jin
    • 1
  • Lihua  Yue
    • 1
  1. 1.Key Laboratory of Electromagnetic Space Information, School of Computer Science and Technology, Chinese Academy of SciencesUniversity of Science and Technology of ChinaHefeiChina

Personalised recommendations