Abstract
A novel spatio-temporal feature learning approach is introduced for action recognition. First, we automatically detect and track the actor, and map the action track to a cuboid. Then, we split the cuboid into block sequences. Each block sequence is represented as a data vector by concatenating the block shape features. For each action category, we use a two-layer network to learn the distribution of the data vectors. The first layer network is constituted by multiple Restricted Boltzmann Machines (RBMs). Each RBM is trained by the data vectors that have the same spatial location. The output of the second layer RBM is the learned spatio-temporal feature. At last, we train a Support Vector Machine classifier for each class to recognize the actions. Experiments on challenging data sets confirm the effectiveness of our approach.
Similar content being viewed by others
References
Carreira-Perpinan, M.A., Hinton, G.E.: On contrastive divergence learning. In: Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics pp. 33–40 (2005)
Chang, C., Lin, C.: Libsvm : a library for support vector machines. ACM Trans. Intel. Syst. Technol. 2, 27 (2011)
Chen, B., Ting, J.A., Marlin, B., de Freitas, N.: Deep learning of invariant spatio-temporal features from video. In: Workshop of Neural Information Processing Systems (2010)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. Conference on Computer Vision and Pattern Recognition pp. 886–893 (2005)
Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. Conference on Computer Vision and Pattern Recognition pp. 1–8 (2008)
Felzenszwalb, P.F., Girshick, R.B., Mcallester, D., Ramanan, D.: Object detection with discriminatively trained part based models. Trans. on Pattern Anal. Mach. Intel. 32(9), 1627–1645 (2010)
Freund, Y., Haussler, D.: Unsupervised learning of distributions on binary vectors using two layer networks. Technical Report UCSC-CRL-94-25 (1994)
Han, B., Comaniciu, D., Zhu, Y., Davis, L.: Sequential kernel density approximation and its application to real-time visual tracking. Trans. Pattern Anal. Mach. Intel. 30(7), 1186–1197 (2008)
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. International Conference on Machine Learning pp. 3212–3220 (2012)
Jiang, Z., Lin, Z., Davis, L.S.: Recognizing human actions by learning and matching shape-motion prototype trees. Trans. Pattern Anal. Mach. Intel. 34(3), 533–547 (2012)
Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. International Conference on Computer Vision pp. 2003–2010 (2011)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. Conference on Computer Vision and Pattern Recognition pp. 1–8 (2008)
Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. Conference on Computer Vision and Pattern Recognition pp. 3361–3368 (2011)
Liang, Z., Wang, X., Huang, R., Lin, L.: An expressive deep model for human action parsing from a single image. International Conference on Multimedia and Expo pp. 1–6 (2014)
Lin, Z., Jiang, Z., Davis, L.S.: Recognizing actions by shape-motion prototype trees. International Conference on Computer Vision pp. 444–451 (2009)
Mahbub, U., Imtiaz, H., Ahad, M.A.R.: Action recognition based on statistical analysis from clustered flow vectors. Signal, Image Video Process. 8(2), 243–253 (2014)
Niebles, J., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008)
Pei, L., Ye, M., Xu, P., Zhao, X., Li, T.: Multi-class action recognition based on inverted index of action states. International Conference on Image Processing pp. 3562–3566 (2013)
Raptis, M., Kokkinos, I., Soatto, S.: Discovering discriminative action parts from mid-level video representations. Conference on Computer Vision and Pattern Recognition pp. 1242–1249 (2012)
Rodriguez, M., Ahmed, J., Shah, M.: Action mach: A spatio-temporal maximum average correlation height filter for action recognition. International Conference on Computer Vision pp. 3361–3366 (2008)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. International Conference on Pattern Recogniztion pp. 32–36 (2004)
Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. ACM Multimedia pp. 357–360 (2007)
T.Joachims: Optimizing search engines using clickthrough data. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) pp. 133–142 (2002)
Wang, H., Ullah, M.M., Kläser, A., Laptev, L., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. British Machine Vision Conference (2010)
Wang, Y., Mori, G.: Learning a discriminative hidden part model for human action recognition. In: Advances in Neural Information Processing Systems pp. 1721–1728 (2008)
Wu, D., Shao, L.: Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. Conference on Computer Vision and Pattern Recognition pp. 724–731 (2014)
Zhang, S., Yao, H., Sun, X., Wang, K., Zhang, J., Lu, X., Zhang, Y.: Action recognition based on overcomplete independent component analysis. Inf. sci. 281, 635–647 (2014)
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (61375038).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pei, L., Ye, M., Zhao, X. et al. Learning spatio-temporal features for action recognition from the side of the video. SIViP 10, 199–206 (2016). https://doi.org/10.1007/s11760-014-0726-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-014-0726-4