Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Sparse Modeling of Human Actions from Motion Imagery

Abstract

An efficient sparse modeling pipeline for the classification of human actions from video is here developed. Spatio-temporal features that characterize local changes in the image are first extracted. This is followed by the learning of a class-structured dictionary encoding the individual actions of interest. Classification is then based on reconstruction, where the label assigned to each video comes from the optimal sparse linear combination of the learned basis vectors (action primitives) representing the actions. A low computational cost deep-layer model learning the inter-class correlations of the data is added for increasing discriminative power. In spite of its simplicity and low computational cost, the method outperforms previously reported results for virtually all standard datasets.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. 1.

    Dense sampling is not an interest point detector per se. It extracts spatio-temporal multi-scale patches indiscriminately throughout the video at all locations.

  2. 2.

    In this work, only a single scale is used to better illustrate the model’s advantages, already achieving state-of-the-art results. A multi-scale approach could certainly be beneficial.

  3. 3.

    In this work, as commonly done in the literature, we assume each video has been already segmented into time segments of uniform (single) actions. Considering we will learn and detect actions based on just a handful of frames, this is not a very restrictive assumption. We will comment more on this later in the paper.

  4. 4.

    http://www.nada.kth.se/cvap/actions/.

  5. 5.

    http://cvrc.ece.utexas.edu/SDHA2010/Aerial_View_Activity.html.

  6. 6.

    http://server.cs.ucf.edu/~vision/data.html#UCFSportsActionDataset.

  7. 7.

    http://www.cs.ucf.edu/~liujg/YouTube_Action_dataset.html.

References

  1. Aharon, M., Elad, M., & Bruckstein, A. (2006). K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11), 4311–4322.

  2. Blake, R., & Shiffrar, M. (2007). Perception of human motion. Annual Review of Psychology, 58(1), 47–73.

  3. Bruckstein, A., Donoho, D., & Elad, M. (2009). From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Review, 51(1), 34–81.

  4. Cadieu, C., & Olshausen, B. A. (2008). Learning transformational invariants from natural movies. In NIPS (pp. 209–216).

  5. Castrodad, A., Xing, Z., Greer, J., Bosch, E., Carin, L., & Sapiro, G. (2011). Learning discriminative sparse representations for modeling, source separation, and mapping of hyperspectral imagery. IEEE Transactions on Geoscience and Remote Sensing, 49(11), 4263–4281.

  6. Charles, A., Olshausen, B., & Rozell, C. (2011). Learning sparse codes for hyperspectral imagery. IEEE Journal of Selected Topics in Signal Processing.

  7. Chen, C., Ryoo, M. S., & Aggarwal, J. K. (2010). UT-Tower dataset: aerial view activity classification challenge. http://cvrc.ece.utexas.edu/SDHA2010/Aerial_View_Activity.html.

  8. Dalal, N., & Triggs, B. (2006). Human detection using oriented histograms of flow and appearance. In ECCV.

  9. Dean, T., Washington, R., & Corrado, G. (2009). Recursive sparse, spatiotemporal coding. In ISM (pp. 645–650).

  10. Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). 2nd joint IEEE international workshop on behavior recognition via sparse spatio-temporal features. In Visual surveillance and performance evaluation of tracking and surveillance (pp. 65–72).

  11. Donoho, D. L. (2000). High-dimensional data analysis: the curses and blessings of dimensionality. In American Mathematical Society conference math challenges of the 21st century.

  12. Gall, J., Yao, A., Razavi, N., van Gool, L., & Lempitsky, V. (2011). Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11), 2188–2202.

  13. Gregor, K., & LeCun, Y. (2010). Learning fast approximations of sparse coding. In ICML (pp. 399–406).

  14. Guo, K., Ishwar, P., & Konrad, J. (2010). Action recognition using sparse representation on covariance manifolds of optical flow. In AVSS (pp. 188–195).

  15. Ikizler-Cinbis, N., & Sclaroff, S. (2010). Object, scene and actions: combining multiple features for human action recognition. In ECCV (pp. 494–507).

  16. Jhuang, H., Serre, T., Wolf, L., & Poggio, T. (2007). A biologically inspired system for action recognition. In ICCV (pp. 1–8).

  17. Kläser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In BMVC.

  18. Kovashka, A., & Grauman, K. (2010). Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In CVPR (pp. 2046–2053).

  19. Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In ICCV (pp. 432–439).

  20. Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.

  21. Le, Q., Zou, W., Yeung, S., & Ng, A. (2011). Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis. In CVPR.

  22. Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos “in the wild”. In CVPR.

  23. Mairal, J., Bach, F., Ponce, J., Sapiro, G., & Zisserman, A. (2008). Supervised dictionary learning. In NIPS (pp. 1033–1040).

  24. Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11, 19–60.

  25. Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In CVPR.

  26. Ramirez, I., Sprechmann, P., & Sapiro, G. (2010). Classification and clustering via dictionary learning with structured incoherence and shared features. In CVPR (pp. 3501–3508).

  27. Rodriguez, M. D., Ahmed, J., & Shah, M. (2008). Action Mach: a spatio-temporal maximum average correlation height filter for action recognition. In CVPR.

  28. Ryoo, M., Chen, C., Aggarwal, J., & Chowdhury, R. A. (2010). An overview of contest on semantic description of human activities (sdha) 2010. In ICPR-contests (pp. 270–285).

  29. Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: a local svm approach. In ICPR (pp. 32–36).

  30. Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional sift descriptor and its application to action recognition. In ACM multimedia (pp. 357–360).

  31. Shao, L., & Mattivi, R. (2010). Feature detector and descriptor evaluation in human action recognition. In CIVR (pp. 477–484).

  32. Sprechmann, P., & Sapiro, G. (2010). Dictionary learning and sparse coding for unsupervised clustering. In ICASSP.

  33. Taylor, G., Fergus, R., Le Cun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In ECCV (pp. 140–153).

  34. Tibshirani, R. (1994). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.

  35. Vezzani, R., Davide, B., & Cucchiara, R. (2010). HMM based action recognition with projection histogram features. In ICPR (pp. 286–293).

  36. Wang, H., Ullah, M. M., Kläser, A., Laptev, I., & Schmid, C. (2009). Evaluation of local spatio-temporal features for action recognition. In BMVC.

  37. Wang, H., Kläser, A., Schmid, C., & Cheng-Lin, L. (2011). Action recognition by dense trajectories. In CVPR (pp. 3169–3176).

  38. Willems, G., Tuytelaars, T., & van Gool, L. (2008). An efficient dense and scale-invariant spatio-temporal interest point detector. In ECCV (pp. 650–663).

  39. Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S., & Ma, Y. (2008). Robust face recognition via sparse representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(2), 210–227.

  40. Xiang, Z., Xu, H., & Ramadge, P. (2011). Learning sparse representations of high dimensional data on large scale dictionaries. In NIPS (pp. 900–908).

  41. Yeffet, L., & Wolf, L. (2009). Local trinary patterns for human action recognition. In ICCV (pp. 492–497).

Download references

Acknowledgements

Work partially supported by NGA, ONR, ARO, NSF, Level Sets Systems, and AFOSR (NSSEFF). The authors would like to thank Pablo Sprechmann, Dr. Mariano Tepper, and David S. Hermina for very helpful suggestions and insightful discussions. We also thank Dr. Julien Mairal for providing publicly available sparse modeling code (SPAMS http://www.di.ens.fr/willow/SPAMS/downloads.html) used in this work.

Author information

Correspondence to Alexey Castrodad.

Additional information

Alexey Castrodad is also with NGA.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Castrodad, A., Sapiro, G. Sparse Modeling of Human Actions from Motion Imagery. Int J Comput Vis 100, 1–15 (2012). https://doi.org/10.1007/s11263-012-0534-7

Download citation

Keywords

  • Action classification
  • Sparse modeling
  • Dictionary learning
  • Supervised learning