An efficient sparse modeling pipeline for the classification of human actions from video is here developed. Spatio-temporal features that characterize local changes in the image are first extracted. This is followed by the learning of a class-structured dictionary encoding the individual actions of interest. Classification is then based on reconstruction, where the label assigned to each video comes from the optimal sparse linear combination of the learned basis vectors (action primitives) representing the actions. A low computational cost deep-layer model learning the inter-class correlations of the data is added for increasing discriminative power. In spite of its simplicity and low computational cost, the method outperforms previously reported results for virtually all standard datasets.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Dense sampling is not an interest point detector per se. It extracts spatio-temporal multi-scale patches indiscriminately throughout the video at all locations.
In this work, only a single scale is used to better illustrate the model’s advantages, already achieving state-of-the-art results. A multi-scale approach could certainly be beneficial.
In this work, as commonly done in the literature, we assume each video has been already segmented into time segments of uniform (single) actions. Considering we will learn and detect actions based on just a handful of frames, this is not a very restrictive assumption. We will comment more on this later in the paper.
Aharon, M., Elad, M., & Bruckstein, A. (2006). K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11), 4311–4322.
Blake, R., & Shiffrar, M. (2007). Perception of human motion. Annual Review of Psychology, 58(1), 47–73.
Bruckstein, A., Donoho, D., & Elad, M. (2009). From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Review, 51(1), 34–81.
Cadieu, C., & Olshausen, B. A. (2008). Learning transformational invariants from natural movies. In NIPS (pp. 209–216).
Castrodad, A., Xing, Z., Greer, J., Bosch, E., Carin, L., & Sapiro, G. (2011). Learning discriminative sparse representations for modeling, source separation, and mapping of hyperspectral imagery. IEEE Transactions on Geoscience and Remote Sensing, 49(11), 4263–4281.
Charles, A., Olshausen, B., & Rozell, C. (2011). Learning sparse codes for hyperspectral imagery. IEEE Journal of Selected Topics in Signal Processing.
Chen, C., Ryoo, M. S., & Aggarwal, J. K. (2010). UT-Tower dataset: aerial view activity classification challenge. http://cvrc.ece.utexas.edu/SDHA2010/Aerial_View_Activity.html.
Dalal, N., & Triggs, B. (2006). Human detection using oriented histograms of flow and appearance. In ECCV.
Dean, T., Washington, R., & Corrado, G. (2009). Recursive sparse, spatiotemporal coding. In ISM (pp. 645–650).
Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). 2nd joint IEEE international workshop on behavior recognition via sparse spatio-temporal features. In Visual surveillance and performance evaluation of tracking and surveillance (pp. 65–72).
Donoho, D. L. (2000). High-dimensional data analysis: the curses and blessings of dimensionality. In American Mathematical Society conference math challenges of the 21st century.
Gall, J., Yao, A., Razavi, N., van Gool, L., & Lempitsky, V. (2011). Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11), 2188–2202.
Gregor, K., & LeCun, Y. (2010). Learning fast approximations of sparse coding. In ICML (pp. 399–406).
Guo, K., Ishwar, P., & Konrad, J. (2010). Action recognition using sparse representation on covariance manifolds of optical flow. In AVSS (pp. 188–195).
Ikizler-Cinbis, N., & Sclaroff, S. (2010). Object, scene and actions: combining multiple features for human action recognition. In ECCV (pp. 494–507).
Jhuang, H., Serre, T., Wolf, L., & Poggio, T. (2007). A biologically inspired system for action recognition. In ICCV (pp. 1–8).
Kläser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In BMVC.
Kovashka, A., & Grauman, K. (2010). Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In CVPR (pp. 2046–2053).
Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In ICCV (pp. 432–439).
Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.
Le, Q., Zou, W., Yeung, S., & Ng, A. (2011). Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis. In CVPR.
Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos “in the wild”. In CVPR.
Mairal, J., Bach, F., Ponce, J., Sapiro, G., & Zisserman, A. (2008). Supervised dictionary learning. In NIPS (pp. 1033–1040).
Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11, 19–60.
Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In CVPR.
Ramirez, I., Sprechmann, P., & Sapiro, G. (2010). Classification and clustering via dictionary learning with structured incoherence and shared features. In CVPR (pp. 3501–3508).
Rodriguez, M. D., Ahmed, J., & Shah, M. (2008). Action Mach: a spatio-temporal maximum average correlation height filter for action recognition. In CVPR.
Ryoo, M., Chen, C., Aggarwal, J., & Chowdhury, R. A. (2010). An overview of contest on semantic description of human activities (sdha) 2010. In ICPR-contests (pp. 270–285).
Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: a local svm approach. In ICPR (pp. 32–36).
Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional sift descriptor and its application to action recognition. In ACM multimedia (pp. 357–360).
Shao, L., & Mattivi, R. (2010). Feature detector and descriptor evaluation in human action recognition. In CIVR (pp. 477–484).
Sprechmann, P., & Sapiro, G. (2010). Dictionary learning and sparse coding for unsupervised clustering. In ICASSP.
Taylor, G., Fergus, R., Le Cun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In ECCV (pp. 140–153).
Tibshirani, R. (1994). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.
Vezzani, R., Davide, B., & Cucchiara, R. (2010). HMM based action recognition with projection histogram features. In ICPR (pp. 286–293).
Wang, H., Ullah, M. M., Kläser, A., Laptev, I., & Schmid, C. (2009). Evaluation of local spatio-temporal features for action recognition. In BMVC.
Wang, H., Kläser, A., Schmid, C., & Cheng-Lin, L. (2011). Action recognition by dense trajectories. In CVPR (pp. 3169–3176).
Willems, G., Tuytelaars, T., & van Gool, L. (2008). An efficient dense and scale-invariant spatio-temporal interest point detector. In ECCV (pp. 650–663).
Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S., & Ma, Y. (2008). Robust face recognition via sparse representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(2), 210–227.
Xiang, Z., Xu, H., & Ramadge, P. (2011). Learning sparse representations of high dimensional data on large scale dictionaries. In NIPS (pp. 900–908).
Yeffet, L., & Wolf, L. (2009). Local trinary patterns for human action recognition. In ICCV (pp. 492–497).
Work partially supported by NGA, ONR, ARO, NSF, Level Sets Systems, and AFOSR (NSSEFF). The authors would like to thank Pablo Sprechmann, Dr. Mariano Tepper, and David S. Hermina for very helpful suggestions and insightful discussions. We also thank Dr. Julien Mairal for providing publicly available sparse modeling code (SPAMS http://www.di.ens.fr/willow/SPAMS/downloads.html) used in this work.
Alexey Castrodad is also with NGA.
About this article
Cite this article
Castrodad, A., Sapiro, G. Sparse Modeling of Human Actions from Motion Imagery. Int J Comput Vis 100, 1–15 (2012). https://doi.org/10.1007/s11263-012-0534-7
- Action classification
- Sparse modeling
- Dictionary learning
- Supervised learning