Abstract
This paper focuses on human action recognition in video sequences. A method based on optical flow estimation is presented, where critical points of this flow field are extracted. Multi-scale trajectories are generated from those points and are characterized in the frequency domain. Finally, a sequence is described by fusing this frequency information with motion orientation and shape information. This method has been tested on video datasets with recognition rates among the highest in the state of the art. Contrary to recent dense sampling strategies, the proposed method only requires critical points of motion flow field, thus permitting a lower computational cost and a better sequence description. A cross-dataset generalization is performed to illustrate the robustness of the method to recognition dataset biases. Results, comparisons and prospects on complex action recognition datasets are finally discussed.
Similar content being viewed by others
References
Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. In: Proceedings of international conference on computer vision (2007)
Bilinski, P., Bremond, F.: Video covariance matrix logarithm for human action recognition in videos. In: International conference on artificial intelligence, Buenos Aires (2015)
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: Proceedings of international conference on computer vision, pp. 1395–1402 (2005)
Can, E., Manmatha, R.: Formulating action recognition as a ranking problem. In: International workshop on action similarity in unconstrained videos (2013)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2, 1–27 (2011)
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, pp. 65–72 (2005)
Fan, H., Cao, Z., Jiang, Y., Yin, Q., Doudou, C.: Learning deep face representation. Comput. Res. Repos. arxiv:1403.2802 (2014)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29(12), 2247–2253 (2007)
Harris, C., Stephens, M.: A combined corner and edge detector. In: Proceedings of fourth alvey vision conference, pp. 147–151 (1988)
Hasan, M., Roy-Chowdhury, A.: A continuous learning framework for activity recognition using deep hybrid feature models. IEEE Trans. Multimed. 99, 1 (2015). doi:10.1109/TMM.2015.2477242
Hastie, T., Rosset, S., Zhu, J., Zou, H.: Multi-class AdaBoost. Stat. Interface 2(3), 349–360 (2009)
Jain, M., Jégou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: Proceedings of conference on computer vision pattern recognition, Portland (2013). http://hal.inria.fr/hal-00813014
Jiang, Y.G., Dai, Q., Xue, X., Liu, W., Ngo, C.W.: Trajectory-based modeling of human actions with motion reference points. In: Proceedings of the 12th European conference on computer vision, vol. part V, ECCV’12, pp. 425–438 (2012)
Kantorov, V., Laptev, I.: Efficient feature extraction, encoding and classification for action recognition. In: Proceedings of conference on computer vision and pattern recognition (2014)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Khosla, A., Zhou, T., Malisiewicz, T., Efros, A.A., Torralba, A.: Undoing the damage of dataset bias. In: Proceedings of European conference on computer vision, pp. 158–171 (2012)
Kliper-Gross, O., Gurovich, Y., Hassner, T., Wolf, L.: Motion interchange patterns for action recognition in unconstrained videos. In: Proceedings of European conference on computer vision, ECCV’12, pp. 256–269. Springer-Verlag, Berlin (2012)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: Proceedings of international conference on computer vision (2011)
Lan, Z., Li, X., Lin, M., Hauptmann, A.G.: Long-short term motion feature for action classification and retrieval. CoRR (2015). arxiv:1502.04132
Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of conference on computer vision and pattern recognition, pp. 1–8 (2008)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. Proc. Conf. Comput. Vis. Pattern Recogn. 2, 2169–2178 (2006). doi:10.1109/CVPR.2006.68
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: Proceedings of conference on computer vision and pattern recognition, pp. 1996–2003 (2009)
Murthy, O., Goecke, R.: Ordered trajectories for large scale human action recognition. In: Proceedings of international conference on computer vision and pattern recognition, pp. 412–419 (2013)
Nasrollahi, K., Guerrero, S., Rasti, P., Anbarjafari, G., Baro, X., Escalante, H.J., Moeslund, T.: Deep learning based super-resolution for improved action recognition (2015)
Niebles, J., Chen, C.W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. Proc. Eur. Conf. Comput. Vis. 6312, 392–405 (2010)
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Res. Repos. (2014). arxiv:1405.4506
Perronnin, F., Sánchez, J., Mensink, T.: Improving the Fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010—European conference on computer vision, vol. 6314, pp. 143–156 (2010)
Péteri, R., Fazekas, S., Huiskes, M.J.: DynTex : a comprehensive database of dynamic textures. Pattern Recog. Lett. (2010)
Ramana Murthy, O., Goecke, R.: Ordered trajectories for large scale human action recognition. In: Proceedings of international conference on computer vision (2013)
Raptis, M., Soatto, S.: Tracklet descriptors for action modeling and video analysis. In: Proceedings of Europeean conference on computer vision, pp. 577–590. Berlin, Heidelberg (2010)
Reddy, K.K., Shah, M.: Recognizing 50 human action categories of web videos. Mach. Vis. Appl. 24(5), 971–981 (2013)
Sadanand, S., Corso, J.J.: Action bank: A high-level representation of activity in video. In: CVPR, pp. 1234–1241. IEEE Computer Society (2012)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: Proceedings of conference on pattern recognition, vol. 3, pp. 32–36 (2004)
Shi, F., Petriu, E., Laganiere, R.: Sampling strategies for real-time action recognition. In: Proceedings of conference on computer vision and pattern recognition (2013)
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep fisher networks for large-scale image classification. Adv. Neural Inf. Process. Syst. 26, 163–171 (2013)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27, 568–576 (2014)
Solmaz, B., Assari, S.M., Shah, M.: Classifying web videos using a global video descriptor. Mach. Vis. Appl. 24(7), 1473–1485 (2013). doi:10.1007/s00138-012-0449-x
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. Comput. Res. Repos. (2012). arxiv:1212.0402
Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using lstms. CoRR (2015). arxiv:1502.04681
Sultani, W., Saleemi, I.: Human action recognition across datasets by foreground-weighted histogram decomposition. Proc. Conf. Comput. Vis. Pattern Recogn., pp. 764–771 (2014). doi: 10.1109/CVPR.2014.103
Sun, D., Roth, S., Black, M.: Secrets of optical flow estimation and their principles. In: Proceedings of conference on computer vision and pattern recognition, pp. 2432–2439 (2010)
Tang, K., Fei-Fei, L., Koller, D.: Learning latent temporal structure for complex event detection. IEEE Conf. Comput. Vis. Pattern Recogn. (CVPR), pp. 1250–1257 (2012). doi:10.1109/CVPR.2012.6247808
Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: Proceedings of conference on computer vision and pattern recognition (2011)
Ullah, M.M., Laptev, I.: Actlets: a novel local representation for human action recognition in video. In: Proceedings of IEEE international conference on image processing, pp. 777–780 (2012)
Vrigkas, M., Karavasilis, V., Nikou, C., Kakadiaris, A.: Matching mixtures of curves for human action recognition. Comput. Vis. Image Underst. 119, 27–40 (2014)
Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: Proceedings of conference on computer vision and pattern recognition, pp. 3169–3176 (2011)
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)
Wang, H., Muneeb Ullah, M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. University of Central Florida, Florida (2009)
Wang, H., Schmid, C.: Action Recognition with Improved Trajectories. In: Proceedings of international conference on computer vision, Sydney, pp. 3551–3558 (2013). doi:10.1109/ICCV.2013.441. http://hal.inria.fr/hal-00873267
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. Comput. Res. Repos. (2015)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: Deepflow: large displacement optical flow with deep matching. Proc. Int. Conf. Comput. Vis., pp. 1385–1392 (2013). doi:10.1109/ICCV.2013.175
Willems, G., Tuytelaars, T., Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Proceedings of European conference on computer vision, Berlin, pp. 650–663 (2008)
Wulff, J., Butler, D.J., Stanley, G.B., Black, M.J.: Lessons and insights from creating a synthetic optical flow benchmark. In: Proceedings of European conference on computer vision, pp. 168–177 (2012)
Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: a comprehensive study. Proc. Conf. Comput. Vis. Pattern Recogn., p. 13 (2006). doi:10.1109/CVPRW.2006.121
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Beaudry, C., Péteri, R. & Mascarilla, L. An efficient and sparse approach for large scale human action recognition in videos. Machine Vision and Applications 27, 529–543 (2016). https://doi.org/10.1007/s00138-016-0760-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00138-016-0760-z