Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Sparse coding-based space-time video representation for action recognition

  • 319 Accesses

  • 4 Citations


Methods based on feature descriptors around local interest points are now widely used in action recognition. Feature points are detected using a number of measures, namely saliency, periodicity, motion activity etc. Each of these measures is usually intensity-based and provides a trade-off between density and informativeness. In this paper, we address the problem of action recognition by representing image sequences as a sparse collection of patch-level space-time events that are salient in both space and time domain. Our method uses a multi-scale volumetric representation of video and adaptively selects an optimal space-time scale under which the saliency of a patch is most significant. The input image sequences are first partitioned into non-overlapping patches. Then, each patch is represented by a vector of coefficients that can linearly reconstruct the patch from a learned dictionary of basis patches. The space-time saliency of patches is measured by Shannon’s self-information entropy, where a patch’s saliency is determined by information variation in the contents of the patch’s spatiotemporal neighborhood. Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed method.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  1. 1.

    Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv 43:1–43

  2. 2.

    Aharon M, Elad M, Bruckstein A (2006) K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. TIP 54(11):4311–4322

  3. 3.

    Ali S, Shah M (2010) Human action recognition in videos using kinematic features and multiple instance learning. PAMI 32(2):288–303

  4. 4.

    Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. ICCV 2:1395–1402

  5. 5.

    Bobick A, Davis J (2001) The recognition of human movement using temporal templates. PAMI 23(3):257–267

  6. 6.

    Bruce ND, Tsotsos JK (2009) Saliency, attention, and visual search: An information theoretic approach. J Vis 36(3):1–24

  7. 7.

    Chen MY, Hauptmann A (2009) MoSIFT: Recognizing human actions in surveillance videos. Carnegie Mellon University, Tech. rep.

  8. 8.

    Chen H, Chen H, Chen Y, Lee S Human action recognition using star skeleton. In: Proceedings of the International Workshop on Video Surveillance and Sensor Networks (VSSN06), Santa Barbara, CA, October 2006, pp. 171–178

  9. 9.

    Cheng MM, Zhang GX, Mitra NJ et al (2011) Global contrast based salient region detection. CVPR:409–416

  10. 10.

    Efros A, Berg A, Mori G, Malik J (2003) Recognizing action at a distance. ICCV 2:726–733

  11. 11.

    Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33:1–22

  12. 12.

    Gangeh MJ, Ghodsi A, Kamel MS (2013) Kernelized supervised dictionary learning. TIP 61:4753–4767

  13. 13.

    Gangeha M, Farahatc A, Ghodsid A, Kamel M (2015) Supervised dictionary learning and sparse representation-a review computer vision and pattern recognition

  14. 14.

    Harel J, Koch C, Perona P (2007) Graph-based visual saliency. NIPS:545–552

  15. 15.

    Kadir T, Brady M (2003) Scale saliency: a novel approach to salient feature and scale selection. VIE:25–28

  16. 16.

    Laptev I, Lindeberg T (2003) Space-time interest points. ICCV 1:432–439

  17. 17.

    Laptev I, Lindeberg T (2003) Space-time interest points. ICCV:432C439

  18. 18.

    Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. CVPR:1–8

  19. 19.

    Li J, Levine MD, An X et al (2011) Saliency detection based on frequency and spatial domain analysis, BMVC. Dundee 86:1–11

  20. 20.

    Liu T, Yuan Z, Sun J et al (2011) Learning to detect a salient object. PAMI 33(2):353–367

  21. 21.

    Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A (2008) Supervised dictionary learning. NIPS:1033–1040

  22. 22.

    Mairal G, Bach F, Ponce J, Sapiro G (2009) Online dictionary learning for sparse coding. ICML:689–696

  23. 23.

    Mairal J, Bach F, Ponce J, Sapiro G (2010) Online learning for matrix factorization and sparse coding. J Mach Learn Res 11:19–60

  24. 24.

    Mairal J, Bach F, Ponce J (2012) Task-driven dictionary learning. PAMI 34:791–804

  25. 25.

    Niebles J, Wang H, Li F (2008) Unsupervised learning of human action categories using spatialCtemporal words. IJCV 79(3):299–318

  26. 26.

    Oikonomopoulos A, Patras I, Pantic M (2006) Spatiotemporal salient points for visual recognition of human actions. IEEE Trans. Systems, Man, and Cybernetics, Part B 36(3):710–719

  27. 27.

    Poppe R (2010) A survey on vision-based human action recognition. Image Vision Comput 28(6):976C990

  28. 28.

    Rapantzikos K, Avrithis Y, Kollias S (2007) Spatiotemporal saliency for event detection and representation in the 3D wavelet domain: potential in human action recognition. ICIVR:294–301

  29. 29.

    Rapantzikos K, Avrithis Y, Kollias S (2009) Dense saliency-based spatiotemporal feature points for action recognition. CVPR:1–8

  30. 30.

    Rudoy D, Goldman D et al (2013) Learning video saliency from human gaze using candidate selection. CVPR:4321–4328

  31. 31.

    Schdt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. ICPR 3:32–36

  32. 32.

    Sun Q, Liu H (2012) Action disambiguation analysis using NorMalized google-like distance correlogram, ACCV

  33. 33.

    Wang L, Suter D (2006) Informative shape representations for human action recognition. ICPR 2:1266–1269

  34. 34.

    Wang Y, Mori G (2009) Human action recognition by semilatent topic models. PAMI 31(10):1762–1774

  35. 35.

    Weinland D, Boyer E, Ronfard R (2007) Action recognition from arbitrary views using 3D exemplars. ICCV:1C8

  36. 36.

    Weinland D, Boyer E, Rhone-alpes L (2008) Action recognition using exemplar-based embedding. CVPR:1–7

  37. 37.

    Willems G, Tuytelaars T, Gool L (2008) An efficient dense and scaleinvariant spatio-temporal interest point detector. ECCV part 2:650–663

  38. 38.

    Yang JC, Yu C, Thomas H (2011) Supervised translationinvariant sparse coding. CVPR:3517–3524

Download references


This research is partly supported by the National 973 Program of China (2013CB329401) and NSFC, China (No: 61273258, 61105001). Shanghai Key Lab of Modern Optical System gives much help for providing the experiment material.

Author information

Correspondence to Yinghua Fu.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Fu, Y., Zhang, T. & Wang, W. Sparse coding-based space-time video representation for action recognition. Multimed Tools Appl 76, 12645–12658 (2017).

Download citation


  • Sparse coding
  • Space-time saliency
  • Action recognition
  • Self-information
  • Shannon entropy