Skip to main content
Log in

Learning spatio-temporal features for action recognition from the side of the video

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

A novel spatio-temporal feature learning approach is introduced for action recognition. First, we automatically detect and track the actor, and map the action track to a cuboid. Then, we split the cuboid into block sequences. Each block sequence is represented as a data vector by concatenating the block shape features. For each action category, we use a two-layer network to learn the distribution of the data vectors. The first layer network is constituted by multiple Restricted Boltzmann Machines (RBMs). Each RBM is trained by the data vectors that have the same spatial location. The output of the second layer RBM is the learned spatio-temporal feature. At last, we train a Support Vector Machine classifier for each class to recognize the actions. Experiments on challenging data sets confirm the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Carreira-Perpinan, M.A., Hinton, G.E.: On contrastive divergence learning. In: Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics pp. 33–40 (2005)

  2. Chang, C., Lin, C.: Libsvm : a library for support vector machines. ACM Trans. Intel. Syst. Technol. 2, 27 (2011)

    Article  Google Scholar 

  3. Chen, B., Ting, J.A., Marlin, B., de Freitas, N.: Deep learning of invariant spatio-temporal features from video. In: Workshop of Neural Information Processing Systems (2010)

  4. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. Conference on Computer Vision and Pattern Recognition pp. 886–893 (2005)

  5. Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. Conference on Computer Vision and Pattern Recognition pp. 1–8 (2008)

  6. Felzenszwalb, P.F., Girshick, R.B., Mcallester, D., Ramanan, D.: Object detection with discriminatively trained part based models. Trans. on Pattern Anal. Mach. Intel. 32(9), 1627–1645 (2010)

    Article  Google Scholar 

  7. Freund, Y., Haussler, D.: Unsupervised learning of distributions on binary vectors using two layer networks. Technical Report UCSC-CRL-94-25 (1994)

  8. Han, B., Comaniciu, D., Zhu, Y., Davis, L.: Sequential kernel density approximation and its application to real-time visual tracking. Trans. Pattern Anal. Mach. Intel. 30(7), 1186–1197 (2008)

    Article  Google Scholar 

  9. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. International Conference on Machine Learning pp. 3212–3220 (2012)

  10. Jiang, Z., Lin, Z., Davis, L.S.: Recognizing human actions by learning and matching shape-motion prototype trees. Trans. Pattern Anal. Mach. Intel. 34(3), 533–547 (2012)

  11. Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. International Conference on Computer Vision pp. 2003–2010 (2011)

  12. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. Conference on Computer Vision and Pattern Recognition pp. 1–8 (2008)

  13. Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. Conference on Computer Vision and Pattern Recognition pp. 3361–3368 (2011)

  14. Liang, Z., Wang, X., Huang, R., Lin, L.: An expressive deep model for human action parsing from a single image. International Conference on Multimedia and Expo pp. 1–6 (2014)

  15. Lin, Z., Jiang, Z., Davis, L.S.: Recognizing actions by shape-motion prototype trees. International Conference on Computer Vision pp. 444–451 (2009)

  16. Mahbub, U., Imtiaz, H., Ahad, M.A.R.: Action recognition based on statistical analysis from clustered flow vectors. Signal, Image Video Process. 8(2), 243–253 (2014)

    Article  Google Scholar 

  17. Niebles, J., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008)

    Article  Google Scholar 

  18. Pei, L., Ye, M., Xu, P., Zhao, X., Li, T.: Multi-class action recognition based on inverted index of action states. International Conference on Image Processing pp. 3562–3566 (2013)

  19. Raptis, M., Kokkinos, I., Soatto, S.: Discovering discriminative action parts from mid-level video representations. Conference on Computer Vision and Pattern Recognition pp. 1242–1249 (2012)

  20. Rodriguez, M., Ahmed, J., Shah, M.: Action mach: A spatio-temporal maximum average correlation height filter for action recognition. International Conference on Computer Vision pp. 3361–3366 (2008)

  21. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. International Conference on Pattern Recogniztion pp. 32–36 (2004)

  22. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. ACM Multimedia pp. 357–360 (2007)

  23. T.Joachims: Optimizing search engines using clickthrough data. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) pp. 133–142 (2002)

  24. Wang, H., Ullah, M.M., Kläser, A., Laptev, L., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. British Machine Vision Conference (2010)

  25. Wang, Y., Mori, G.: Learning a discriminative hidden part model for human action recognition. In: Advances in Neural Information Processing Systems pp. 1721–1728 (2008)

  26. Wu, D., Shao, L.: Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. Conference on Computer Vision and Pattern Recognition pp. 724–731 (2014)

  27. Zhang, S., Yao, H., Sun, X., Wang, K., Zhang, J., Lu, X., Zhang, Y.: Action recognition based on overcomplete independent component analysis. Inf. sci. 281, 635–647 (2014)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (61375038).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mao Ye.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pei, L., Ye, M., Zhao, X. et al. Learning spatio-temporal features for action recognition from the side of the video. SIViP 10, 199–206 (2016). https://doi.org/10.1007/s11760-014-0726-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-014-0726-4

Keywords

Navigation