Methods based on feature descriptors around local interest points are now widely used in action recognition. Feature points are detected using a number of measures, namely saliency, periodicity, motion activity etc. Each of these measures is usually intensity-based and provides a trade-off between density and informativeness. In this paper, we address the problem of action recognition by representing image sequences as a sparse collection of patch-level space-time events that are salient in both space and time domain. Our method uses a multi-scale volumetric representation of video and adaptively selects an optimal space-time scale under which the saliency of a patch is most significant. The input image sequences are first partitioned into non-overlapping patches. Then, each patch is represented by a vector of coefficients that can linearly reconstruct the patch from a learned dictionary of basis patches. The space-time saliency of patches is measured by Shannon’s self-information entropy, where a patch’s saliency is determined by information variation in the contents of the patch’s spatiotemporal neighborhood. Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed method.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv 43:1–43
Aharon M, Elad M, Bruckstein A (2006) K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. TIP 54(11):4311–4322
Ali S, Shah M (2010) Human action recognition in videos using kinematic features and multiple instance learning. PAMI 32(2):288–303
Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. ICCV 2:1395–1402
Bobick A, Davis J (2001) The recognition of human movement using temporal templates. PAMI 23(3):257–267
Bruce ND, Tsotsos JK (2009) Saliency, attention, and visual search: An information theoretic approach. J Vis 36(3):1–24
Chen MY, Hauptmann A (2009) MoSIFT: Recognizing human actions in surveillance videos. Carnegie Mellon University, Tech. rep.
Chen H, Chen H, Chen Y, Lee S Human action recognition using star skeleton. In: Proceedings of the International Workshop on Video Surveillance and Sensor Networks (VSSN06), Santa Barbara, CA, October 2006, pp. 171–178
Cheng MM, Zhang GX, Mitra NJ et al (2011) Global contrast based salient region detection. CVPR:409–416
Efros A, Berg A, Mori G, Malik J (2003) Recognizing action at a distance. ICCV 2:726–733
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33:1–22
Gangeh MJ, Ghodsi A, Kamel MS (2013) Kernelized supervised dictionary learning. TIP 61:4753–4767
Gangeha M, Farahatc A, Ghodsid A, Kamel M (2015) Supervised dictionary learning and sparse representation-a review computer vision and pattern recognition
Harel J, Koch C, Perona P (2007) Graph-based visual saliency. NIPS:545–552
Kadir T, Brady M (2003) Scale saliency: a novel approach to salient feature and scale selection. VIE:25–28
Laptev I, Lindeberg T (2003) Space-time interest points. ICCV 1:432–439
Laptev I, Lindeberg T (2003) Space-time interest points. ICCV:432C439
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. CVPR:1–8
Li J, Levine MD, An X et al (2011) Saliency detection based on frequency and spatial domain analysis, BMVC. Dundee 86:1–11
Liu T, Yuan Z, Sun J et al (2011) Learning to detect a salient object. PAMI 33(2):353–367
Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A (2008) Supervised dictionary learning. NIPS:1033–1040
Mairal G, Bach F, Ponce J, Sapiro G (2009) Online dictionary learning for sparse coding. ICML:689–696
Mairal J, Bach F, Ponce J, Sapiro G (2010) Online learning for matrix factorization and sparse coding. J Mach Learn Res 11:19–60
Mairal J, Bach F, Ponce J (2012) Task-driven dictionary learning. PAMI 34:791–804
Niebles J, Wang H, Li F (2008) Unsupervised learning of human action categories using spatialCtemporal words. IJCV 79(3):299–318
Oikonomopoulos A, Patras I, Pantic M (2006) Spatiotemporal salient points for visual recognition of human actions. IEEE Trans. Systems, Man, and Cybernetics, Part B 36(3):710–719
Poppe R (2010) A survey on vision-based human action recognition. Image Vision Comput 28(6):976C990
Rapantzikos K, Avrithis Y, Kollias S (2007) Spatiotemporal saliency for event detection and representation in the 3D wavelet domain: potential in human action recognition. ICIVR:294–301
Rapantzikos K, Avrithis Y, Kollias S (2009) Dense saliency-based spatiotemporal feature points for action recognition. CVPR:1–8
Rudoy D, Goldman D et al (2013) Learning video saliency from human gaze using candidate selection. CVPR:4321–4328
Schdt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. ICPR 3:32–36
Sun Q, Liu H (2012) Action disambiguation analysis using NorMalized google-like distance correlogram, ACCV
Wang L, Suter D (2006) Informative shape representations for human action recognition. ICPR 2:1266–1269
Wang Y, Mori G (2009) Human action recognition by semilatent topic models. PAMI 31(10):1762–1774
Weinland D, Boyer E, Ronfard R (2007) Action recognition from arbitrary views using 3D exemplars. ICCV:1C8
Weinland D, Boyer E, Rhone-alpes L (2008) Action recognition using exemplar-based embedding. CVPR:1–7
Willems G, Tuytelaars T, Gool L (2008) An efficient dense and scaleinvariant spatio-temporal interest point detector. ECCV part 2:650–663
Yang JC, Yu C, Thomas H (2011) Supervised translationinvariant sparse coding. CVPR:3517–3524
This research is partly supported by the National 973 Program of China (2013CB329401) and NSFC, China (No: 61273258, 61105001). Shanghai Key Lab of Modern Optical System gives much help for providing the experiment material.
About this article
Cite this article
Fu, Y., Zhang, T. & Wang, W. Sparse coding-based space-time video representation for action recognition. Multimed Tools Appl 76, 12645–12658 (2017). https://doi.org/10.1007/s11042-016-3630-9
- Sparse coding
- Space-time saliency
- Action recognition
- Shannon entropy