International Journal of Computer Vision

, Volume 100, Issue 1, pp 1–15 | Cite as

Sparse Modeling of Human Actions from Motion Imagery



An efficient sparse modeling pipeline for the classification of human actions from video is here developed. Spatio-temporal features that characterize local changes in the image are first extracted. This is followed by the learning of a class-structured dictionary encoding the individual actions of interest. Classification is then based on reconstruction, where the label assigned to each video comes from the optimal sparse linear combination of the learned basis vectors (action primitives) representing the actions. A low computational cost deep-layer model learning the inter-class correlations of the data is added for increasing discriminative power. In spite of its simplicity and low computational cost, the method outperforms previously reported results for virtually all standard datasets.


Action classification Sparse modeling Dictionary learning Supervised learning 


  1. Aharon, M., Elad, M., & Bruckstein, A. (2006). K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11), 4311–4322. CrossRefGoogle Scholar
  2. Blake, R., & Shiffrar, M. (2007). Perception of human motion. Annual Review of Psychology, 58(1), 47–73. CrossRefGoogle Scholar
  3. Bruckstein, A., Donoho, D., & Elad, M. (2009). From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Review, 51(1), 34–81. MathSciNetMATHCrossRefGoogle Scholar
  4. Cadieu, C., & Olshausen, B. A. (2008). Learning transformational invariants from natural movies. In NIPS (pp. 209–216). Google Scholar
  5. Castrodad, A., Xing, Z., Greer, J., Bosch, E., Carin, L., & Sapiro, G. (2011). Learning discriminative sparse representations for modeling, source separation, and mapping of hyperspectral imagery. IEEE Transactions on Geoscience and Remote Sensing, 49(11), 4263–4281. CrossRefGoogle Scholar
  6. Charles, A., Olshausen, B., & Rozell, C. (2011). Learning sparse codes for hyperspectral imagery. IEEE Journal of Selected Topics in Signal Processing. Google Scholar
  7. Chen, C., Ryoo, M. S., & Aggarwal, J. K. (2010). UT-Tower dataset: aerial view activity classification challenge.
  8. Dalal, N., & Triggs, B. (2006). Human detection using oriented histograms of flow and appearance. In ECCV. Google Scholar
  9. Dean, T., Washington, R., & Corrado, G. (2009). Recursive sparse, spatiotemporal coding. In ISM (pp. 645–650). Google Scholar
  10. Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). 2nd joint IEEE international workshop on behavior recognition via sparse spatio-temporal features. In Visual surveillance and performance evaluation of tracking and surveillance (pp. 65–72). CrossRefGoogle Scholar
  11. Donoho, D. L. (2000). High-dimensional data analysis: the curses and blessings of dimensionality. In American Mathematical Society conference math challenges of the 21st century. Google Scholar
  12. Gall, J., Yao, A., Razavi, N., van Gool, L., & Lempitsky, V. (2011). Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11), 2188–2202. CrossRefGoogle Scholar
  13. Gregor, K., & LeCun, Y. (2010). Learning fast approximations of sparse coding. In ICML (pp. 399–406). Google Scholar
  14. Guo, K., Ishwar, P., & Konrad, J. (2010). Action recognition using sparse representation on covariance manifolds of optical flow. In AVSS (pp. 188–195). Google Scholar
  15. Ikizler-Cinbis, N., & Sclaroff, S. (2010). Object, scene and actions: combining multiple features for human action recognition. In ECCV (pp. 494–507). Google Scholar
  16. Jhuang, H., Serre, T., Wolf, L., & Poggio, T. (2007). A biologically inspired system for action recognition. In ICCV (pp. 1–8). Google Scholar
  17. Kläser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In BMVC. Google Scholar
  18. Kovashka, A., & Grauman, K. (2010). Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In CVPR (pp. 2046–2053). Google Scholar
  19. Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In ICCV (pp. 432–439). Google Scholar
  20. Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR. Google Scholar
  21. Le, Q., Zou, W., Yeung, S., & Ng, A. (2011). Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis. In CVPR. Google Scholar
  22. Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos “in the wild”. In CVPR. Google Scholar
  23. Mairal, J., Bach, F., Ponce, J., Sapiro, G., & Zisserman, A. (2008). Supervised dictionary learning. In NIPS (pp. 1033–1040). Google Scholar
  24. Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11, 19–60. MathSciNetGoogle Scholar
  25. Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In CVPR. Google Scholar
  26. Ramirez, I., Sprechmann, P., & Sapiro, G. (2010). Classification and clustering via dictionary learning with structured incoherence and shared features. In CVPR (pp. 3501–3508). Google Scholar
  27. Rodriguez, M. D., Ahmed, J., & Shah, M. (2008). Action Mach: a spatio-temporal maximum average correlation height filter for action recognition. In CVPR. Google Scholar
  28. Ryoo, M., Chen, C., Aggarwal, J., & Chowdhury, R. A. (2010). An overview of contest on semantic description of human activities (sdha) 2010. In ICPR-contests (pp. 270–285). Google Scholar
  29. Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: a local svm approach. In ICPR (pp. 32–36). Google Scholar
  30. Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional sift descriptor and its application to action recognition. In ACM multimedia (pp. 357–360). Google Scholar
  31. Shao, L., & Mattivi, R. (2010). Feature detector and descriptor evaluation in human action recognition. In CIVR (pp. 477–484). CrossRefGoogle Scholar
  32. Sprechmann, P., & Sapiro, G. (2010). Dictionary learning and sparse coding for unsupervised clustering. In ICASSP. Google Scholar
  33. Taylor, G., Fergus, R., Le Cun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In ECCV (pp. 140–153). Google Scholar
  34. Tibshirani, R. (1994). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288. MathSciNetGoogle Scholar
  35. Vezzani, R., Davide, B., & Cucchiara, R. (2010). HMM based action recognition with projection histogram features. In ICPR (pp. 286–293). Google Scholar
  36. Wang, H., Ullah, M. M., Kläser, A., Laptev, I., & Schmid, C. (2009). Evaluation of local spatio-temporal features for action recognition. In BMVC. Google Scholar
  37. Wang, H., Kläser, A., Schmid, C., & Cheng-Lin, L. (2011). Action recognition by dense trajectories. In CVPR (pp. 3169–3176). CrossRefGoogle Scholar
  38. Willems, G., Tuytelaars, T., & van Gool, L. (2008). An efficient dense and scale-invariant spatio-temporal interest point detector. In ECCV (pp. 650–663). Google Scholar
  39. Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S., & Ma, Y. (2008). Robust face recognition via sparse representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(2), 210–227. CrossRefGoogle Scholar
  40. Xiang, Z., Xu, H., & Ramadge, P. (2011). Learning sparse representations of high dimensional data on large scale dictionaries. In NIPS (pp. 900–908). Google Scholar
  41. Yeffet, L., & Wolf, L. (2009). Local trinary patterns for human action recognition. In ICCV (pp. 492–497). Google Scholar

Copyright information

© Springer Science+Business Media, LLC (outside the USA) 2012

Authors and Affiliations

  1. 1.Department of Electrical and Computer EngineeringUniversity of MinnesotaMinneapolisUSA

Personalised recommendations