Cognitive Computation

, Volume 3, Issue 1, pp 167–184 | Cite as

Spatiotemporal Features for Action Recognition and Salient Event Detection

  • Konstantinos RapantzikosEmail author
  • Yannis Avrithis
  • Stefanos Kollias


Although the mechanisms of human visual understanding remain partially unclear, computational models inspired by existing knowledge on human vision have emerged and applied to several fields. In this paper, we propose a novel method to compute visual saliency from video sequences by counting in the actual spatiotemporal nature of the video. The visual input is represented by a volume in space--time and decomposed into a set of feature volumes in multiple resolutions. Feature competition is used to produce a saliency distribution of the input implemented by constrained minimization. The proposed constraints are inspired by and associated with the Gestalt laws. There are a number of contributions in this approach, namely extending existing visual feature models to a volumetric representation, allowing competition across features, scales and voxels, and formulating constraints in accordance with perceptual principles. The resulting saliency volume is used to detect prominent spatiotemporal regions and consequently applied to action recognition and perceptually salient event detection in video sequences. Comparisons against established methods on public datasets are given and reveal the potential of the proposed model. The experiments include three action recognition scenarios and salient temporal segment detection in a movie database annotated by humans.


Spatiotemporal visual saliency Volumetric representation Action recognition Salient event detection 


  1. 1.
    James W. The principles of psychology; 1890.Google Scholar
  2. 2.
    Treisman A. A feature-integration theory of attention. Cogn Psychol. 1980;12(1):97–136.PubMedCrossRefGoogle Scholar
  3. 3.
    Koch C. Shifts in selective visual attention: towards the underlying neural circuitry. Comput Vis Image Underst 1985;4:219–27.Google Scholar
  4. 4.
    Itti L, Koch C. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 1998;20(11):1254–59 doi: 10.1109/34.730558.CrossRefGoogle Scholar
  5. 5.
    Milanese R, Gil S. Attentive mechanisms for dynamic and static scene analysis. Opt Eng 1995;34(8):2428–34.CrossRefGoogle Scholar
  6. 6.
    Rapantzikos K, Tsapatsoulis N. Spatiotemporal visual attention architecture for video analysis. In: Multimedia signal processing, 2004 IEEE 6th workshop on 2004. p. 83–6.Google Scholar
  7. 7.
    Rapantzikos K. Coupled hidden markov models for complex action recognition. In: An enhanced spatiotemporal visual attention model for sports video analysis, international workshop on content-based multimedia indexing (CBMI); 2005.Google Scholar
  8. 8.
    Rapantzikos K, Tsapatsoulis N. A bottom-up spatiotemporal visual attention model for video analysis. IET Image Process. 2007;1(2):237–48.CrossRefGoogle Scholar
  9. 9.
    Rapantzikos K, Avrithis Y. Dense saliency-based spatiotemporal feature points for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE; 2009.Google Scholar
  10. 10.
    Koffka K. Principles of Gestalt psychology. New York: Harcourt; 1935.Google Scholar
  11. 11.
    Wertheimer M. Laws of organization in perceptual forms (partial translation). A sourcebook of Gestalt psychology; 1938.Google Scholar
  12. 12.
    Wolfe J. Visual search in continuous, naturalistic stimuli. Vis Res(Oxf). 1994;34(9):1187–95.CrossRefGoogle Scholar
  13. 13.
    Tsotsos JK, Culhane SM. Modeling visual attention via selective tuning. Artif Intell. 1995;78(1–2):507–45.Google Scholar
  14. 14.
    Harris C. A combined corner and edge detector. In: Alvey vision conference 1988. vol 15, p. 50.Google Scholar
  15. 15.
    Lindeberg T. Feature detection with automatic scale selection. Int J Comput Vis. 1998;30(2):79–116.CrossRefGoogle Scholar
  16. 16.
    Kadir T. Saliency, scale and image description. Int J Comput Vis 2001;45(2):83–105.CrossRefGoogle Scholar
  17. 17.
    Mikolajczyk K, Tuytelaars T. A comparison of affine region detectors. Int J Comput Vis 2005;65(1):43–72.CrossRefGoogle Scholar
  18. 18.
    Csurka G, Dance C. Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV; 2004. p. 1–22.Google Scholar
  19. 19.
    Mikolajczyk K. An affine invariant interest point detector. Lect Notes Comput Sci. 2002;128–142.Google Scholar
  20. 20.
    Koch C. Visual categorization with bags of key-points. In: European conference on computer vision; 2004. p. 1–22.Google Scholar
  21. 21.
    Hao YW. Unsupervised discovery of action classes. Comput Vis Pattern Recognit. 2006;2:17–22Google Scholar
  22. 22.
    Bosch A, Zisserman A. Scene classification via plsa. In: European conference on computer vision; 2006. p. 517–30.Google Scholar
  23. 23.
    Laptev I. Interest point detection and scale selection in space-time. Lect Notes Comput Sci. 2003;372–387.Google Scholar
  24. 24.
    Laptev I. On space-time interest points. Int J Comput Vis. 2005;64(2):107–23.CrossRefGoogle Scholar
  25. 25.
    Laptev I, Caputo B. Local velocity-adapted motion events for spatio-temporal recognition. Comput Vis Image Underst. 2007;108(3):207–29.CrossRefGoogle Scholar
  26. 26.
    Dollár P, Rabaud V. Behavior recognition via sparse spatio-temporal features. In: 2nd Joint IEEE International Workshop on visual surveillance and performance evaluation of tracking and surveillance, 2005; 2005. p. 65–72.Google Scholar
  27. 27.
    Niebles JC, Wang H. Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis. 2008;79(3):299–318.CrossRefGoogle Scholar
  28. 28.
    Oikonomopoulos A, Patras I. Spatiotemporal salient points for visual recognition of human actions. IEEE Trans Syst Man Cybern B Cybern. 2006;36(3):710–9.CrossRefGoogle Scholar
  29. 29.
    Wong SF. Extracting spatiotemporal interest points using global information. In: IEEE 11th International Conference on Computer Vision, 2007. ICCV 2007; 2007. p. 1–8.Google Scholar
  30. 30.
    Blank M, Gorelick L. Actions as space-time shapes. In: Tenth IEEE International Conference on Computer Vision, 2005. ICCV 2005; 2005. p. 2.Google Scholar
  31. 31.
    Shechtman E. Space-time behavior based correlation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005; 2005. p. 1.Google Scholar
  32. 32.
    Shechtman E. Matching local self-similarities across images and videos. In: Proceedings of CVPR; 2007.Google Scholar
  33. 33.
    Ke Y. PCA-SIFT: a more distinctive representation for local image descriptors. In: IEEE Compuer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 1999; 2004. p. 2.Google Scholar
  34. 34.
    Hamid R, Johnson A. Detection and explanation of anomalous activities: representing activities as bags of event n-Grams. In: IEEE Compuer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 1999; 2005. vol 1, p. 1031.Google Scholar
  35. 35.
    Itti L. A principled approach to detecting surprising events in video. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR05); 2005. p. 1.Google Scholar
  36. 36.
    Zhong H, Shi J. Detecting unusual activity in video. In: IEEE Compuer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society 1999; 2004. p. 2.Google Scholar
  37. 37.
    Ma YF, Lu L. A user attention model for video summarization. In: Proceedings of the tenth ACM international conference on Multimedia. New York: ACM; 2002. p. 533–42.Google Scholar
  38. 38.
    Ma YF, Hua XS. A generic framework of user attention model and its application in video summarization. IEEE Trans Multimed. 2005;7(5):907–19.CrossRefGoogle Scholar
  39. 39.
    Boiman O. Detecting irregularities in images and in video. In: IEEE International Conference on Computer Vision (ICCV); 2005.Google Scholar
  40. 40.
    Stauffer C. Learning patterns of activity using real-time tracking. IEEE Trans Pattern Anal Mach Intell. 2000;22(8):747–57.CrossRefGoogle Scholar
  41. 41.
    Adam A, Rivlin E. Robust real-time unusual event detection using multiple fixed-location monitors. IEEE Trans Pattern Anal Mach Intell. 2008; 555–60.Google Scholar
  42. 42.
    Evangelopoulos G, Rapantzikos K. Movie summarization based on audiovisual saliency detection. In: 15th IEEE International Conference on Image Processing, 2008. ICIP 2008; 2008. p. 2528–31.Google Scholar
  43. 43.
    Evangelopoulos G. Video event detection and summarization using audio, visual and text saliency. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP09); 2009. p. 3553–6.Google Scholar
  44. 44.
    Rapantzikos K, Evangelopoulos G. An audio-visual saliency model for movie summarization. In: Proceedings of IEEE International Workshop on Multimedia Signal Processing (MMSP07); 2007.Google Scholar
  45. 45.
    Caspi Y. Spatio-temporal alignment of sequences. IEEE Trans Pattern Anal Mach Intell (PAMI). 2002;24:1409–24.CrossRefGoogle Scholar
  46. 46.
    Tikhonov AN. Solution of ill-posed problems. Washington, DC: W. H. Winston; 1977.Google Scholar
  47. 47.
    Rapantzikos K. Robust optical flow estimation in MPEG sequences. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. (ICASSP’05); 2005. p. 2.Google Scholar
  48. 48.
    Hering E. Outlines of a theory of the light sense. Cambridge: Harvard University Press; 1964.Google Scholar
  49. 49.
    Freeman WT. The design and use of steerable filters. IEEE Trans Pattern Anal Mach Intell. 1991;13(9):891–906.CrossRefGoogle Scholar
  50. 50.
    Derpanis KG. Three-dimensional nth derivative of gaussian separable steerable filters. In: IEEE International Conference on Image Processing; 2005.Google Scholar
  51. 51.
    Wildes R. Qualitative spatiotemporal analysis using an oriented energy representation. Lect Notes Comput Sci. 2000;768–84.Google Scholar
  52. 52.
    Riedmiller M. Advanced supervised learning in multi-layer perceptrons—from backpropagation to adaptive learning algorithms. Int J Comput Stand Interf Spec Issue Neural Netw. 1994;16:265–78.Google Scholar
  53. 53.
    Schuldt C, Laptev I. Recognizing human actions: a local SVM approach. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004. 2004. p. 3.Google Scholar
  54. 54.
    Laptev I, Marszalek M. Learning realistic human actions from movies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2008.Google Scholar
  55. 55.
    Ke Y, Sukthankar R. Spatio-temporal shape and flow correlation for action recognition. In: 7th International Workshop on Visual Surveillance; 2007.Google Scholar
  56. 56.
    Willems G, Tuytelaars T. An efficient dense and scale-invariant spatio-temporal interest point detector. In: Lecture Notes in Computer Science, Marseille, France; 2008. p. 650–3.Google Scholar
  57. 57.
    Jiang H, Drew MS. Successive convex matching for action detection. In: Proceedings of IEEE Conference Computer Vision and Pattern Recognition; 2006. p. 1646–53.Google Scholar
  58. 58.
    Sivic J, Russell BC, Efros A, Zisserman A, Freeman WT. Discovering object categories in image collections. In: International Conference on Computer Vision (ICCV05); 2005.Google Scholar
  59. 59.
    Dempster AP, Laird NM. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 1977;39(1):1–38.Google Scholar
  60. 60.
    MUSCLE. Movie dialogue dataBase v1. 1. Aristotle University of Thessaloniki, AIILab; 2007.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Konstantinos Rapantzikos
    • 1
    Email author
  • Yannis Avrithis
    • 1
  • Stefanos Kollias
    • 1
  1. 1.National Technical University of AthensAthensGreece

Personalised recommendations