An in-depth evaluation framework for spatio-temporal features

  • Julian Stottinger
  • Naeem Bhatti
  • Allan Hanbury


The most successful approaches to video understanding and video matching use local spatio-temporal features as a sparse representation for video content. In the last decade, a great interest in evaluation of local visual features in the domain of images is observed. The aim is to provide researchers with guidance when selecting the best approaches for new applications and data-sets. FeEval is presented, a framework for the evaluation of spatio-temporal features. For the first time, this framework allows for a systematic measurement of the stability and the invariance of local features in videos. FeEval consists of 30 original videos from a great variety of different sources, including HDTV shows, 1080p HD movies and surveillance cameras. The videos are iteratively varied by well defined challenges leading to a total of 1710 video clips. We measure coverage, repeatability and matching performance under these challenges. Similar to prior work on 2D images, this leads to a new robustness and matching measurement. Supporting the choices of recent state of the art benchmarks, this allows for a in-depth analysis of spatio-temporal features in comparison to recent benchmark results.


Local feature Evaluation Video Spatio-temporal 



  1. 1.
    Bay H, Tuytelaars T, Van Gool L (2006) SURF: speeded up robust features. In: ECCV, pp 346–359Google Scholar
  2. 2.
    Bilinski PT, Brémond F (2011) Evaluation of local descriptors for action recognition in videos. In: Computer vision systems - 8th international conference, ICVS 2011, Sophia Antipolis, France, September 20–22, 2011. Proceedings, pp 61–70Google Scholar
  3. 3.
    Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: ICCV, vol 2, pp 1395–1402, DOI, (to appear in print)
  4. 4.
    Chaquet JM, Carmona EJ, Fernández-Caballero A (2013) A survey of video datasets for human action and activity recognition. Comput Vis Image Underst 117 (6):633–659CrossRefGoogle Scholar
  5. 5.
    Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: CVPR, pp 1932–1939Google Scholar
  6. 6.
    Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: VS-PETS, pp 65–72Google Scholar
  7. 7.
    Duchenne O, Laptev I, Sivic J, Bach F, Ponce J (2009) Automatic annotation of human actions in video. In: ICCV, pp 1395–1402Google Scholar
  8. 8.
    Everts I, van Gemert JC, Gevers T (2014) Evaluation of color spatio-temporal interest points for human action recognition. IEEE Trans Image Process 23(4):1569–1580MathSciNetCrossRefGoogle Scholar
  9. 9.
    Gaidon A, Harchaoui Z, Schmid C (2011) Actom sequence models for efficient action detection. In: CVPRGoogle Scholar
  10. 10.
    Gao Z, Nie W, Liu A, Zhang H (2016) Evaluation of local spatial-temporal features for cross-view action recognition. Neurocomput 173(P1):110–117CrossRefGoogle Scholar
  11. 11.
    Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space-time shapes. PAMI 29(12):2247–2253CrossRefGoogle Scholar
  12. 12.
    Harris C, Stephens M (1988) A combined corner and edge detection. In: 4th Alvey vision conference, pp 147–151Google Scholar
  13. 13.
    Hassner T (2013) A critical review of action recognition benchmarks. In: The IEEE conference on computer vision and pattern recognition (CVPR) workshopsGoogle Scholar
  14. 14.
    Jhuang H, Serre T, Wolf L, Poggio T (2007) A biologically inspired system for action recognition. In: ICCV, pp 1–8, DOI, (to appear in print)
  15. 15.
    Junejo I, Dexter E, Laptev I, Pérez P (2009) View-independent action recognition from temporal self-similarities. PAMIGoogle Scholar
  16. 16.
    Ke Q, Kanade T (2005) Quasiconvex optimization for robust geometric reconstruction. In: ICCV, pp 986–993Google Scholar
  17. 17.
    Ke Y, Sukthankar R, Hebert M (2005) Efficient visual event detection using volumetric features. In: ICCV, pp 166–173Google Scholar
  18. 18.
    Kläser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: BMVC, pp 995–1004.
  19. 19.
    Kliper-Gross O, Hassner T, Wolf L (2012) The action similarity labeling challenge. IEEE Trans Pattern Anal Mach Intell (TPAMI) 34(3)Google Scholar
  20. 20.
    Laptev I (2005) On space-time interest points. IJCV 64(2):107–123CrossRefGoogle Scholar
  21. 21.
    Laptev I, Lindeberg T (2003) Interest point detection and scale selection in space-time. In: Scale space methods in computer vision, pp 372–387Google Scholar
  22. 22.
    Laptev I, Pérez P (2007) Retrieving actions in movies. In: ICCV, pp 1–8Google Scholar
  23. 23.
    Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: CVPR, pp 1–8Google Scholar
  24. 24.
    Lindeberg T (1998) Feature detection with automatic scale selection. IJCV 30 (2):79–116CrossRefGoogle Scholar
  25. 25.
    Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: CVPR, pp 2929–2936Google Scholar
  26. 26.
    Mikolajczyk K, Schmid C (2004) Scale and affine invariant interest point detectors. IJCV 60(1):63–86CrossRefGoogle Scholar
  27. 27.
    Mikolajczyk K, Schmid C (2005) A performance evaluation of local descriptors. PAMI 27(10):1615–1630CrossRefGoogle Scholar
  28. 28.
    Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, Matas J, Schaffalitzky F, Kadir T, can Gool L (2005) A comparison of affine region detectors. IJCV 65(1/2):43–72CrossRefGoogle Scholar
  29. 29.
    Oikonomopoulos A, Patras I, Pantic M (2006) Kernel-based recognition of human actions using spatiotemporal salient points. In: CVPR, pp 151–159.
  30. 30.
    Pönitz T, Donner R, Stöttinger J, Hanbury A (2010) Efficient and distinct large scale bags of words. In: AAPR, pp 139–146Google Scholar
  31. 31.
    Rodriguez MD, Ahmed J, Shah M (2008) Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR, pp 1–8Google Scholar
  32. 32.
    Schüldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: ICPR, pp 32–36Google Scholar
  33. 33.
    Shabani AH, Clausi DA (2012) Evaluation of local spatio-temporal salient feature detectors for human action recognition. In: IEEE Canadian conference on computer and robot visionGoogle Scholar
  34. 34.
    Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and trecvid. In: Proceedings of the 8th ACM international workshop on multimedia information retrieval, MIR ’06, pp 321–330Google Scholar
  35. 35.
    Stöttinger J, Zambanini S, Khan R, Hanbury A (2010) FeEval—a dataset for evaluation of spatio-temporal local features. In: ICPR, pp 499–503Google Scholar
  36. 36.
    Svoboda T, Martinec D, Pajdla T (2005) A convenient multicamera self-calibration for virtual environments. PTVE 14(4):407–422. CrossRefGoogle Scholar
  37. 37.
    Tamrakar A, Ali S, Yu Q, Liu J, Javed O, Divakaran A, Cheng H, Sawhney HS (2012) Evaluation of low-level features and their combinations for complex event detection in open source videos. In: 2012 IEEE conference on computer vision and pattern recognition, Providence, RI, USA, June 16–21, 2012, pp 3681–3688Google Scholar
  38. 38.
    Wang L, Zhou L, Shen C (2008) A fast algorithm for creating a compact and discriminative visual codebook. In: ECCV, pp 719–732Google Scholar
  39. 39.
    Wang H, Ullah M, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC, pp 127–138Google Scholar
  40. 40.
    Wang H, Kläser A, Schmid C, Cheng-Lin L (2011) Action recognition by dense trajectories. In: CVPR, pp 3169–3176Google Scholar
  41. 41.
    Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2):249–257CrossRefGoogle Scholar
  42. 42.
    Willems G, Tuytelaars T, Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: ECCV, pp 650–663, DOI, (to appear in print)
  43. 43.
    Wong SF, Cipolla R (2007) Extracting spatiotemporal interest points using global information. In: ICCV, pp 1–8Google Scholar
  44. 44.
    Xian Y, Rong X, Yang X, Tian Y (2017) Evaluation of low-level features for real-world surveillance event detection. IEEE Trans Circuits Syst Video Technol 27 (3):624–634CrossRefGoogle Scholar
  45. 45.
    Yan C, Xie H, Chen J, Zha Z, Hao X, Zhang Y, Dai Q (2018) An effective uyghur text detector for complex background images. IEEE Trans Multimed 1–1Google Scholar
  46. 46.
    Yan C, Xie H, Liu S, Yin J, Zhang Y, Dai Q (2018) Effective uyghur language text detection in complex background images for traffic prompt identification. IEEE Trans Intell Transp Syst 19(1):220–229CrossRefGoogle Scholar
  47. 47.
    Yan C, Xie H, Yang D, Yin J, Zhang Y, Dai Q (2018) Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans Intell Transp Syst 19(1):284–295CrossRefGoogle Scholar
  48. 48.
    Yan C, Zhang Y, Xu J, Dai F, Li L, Dai Q, Wu F (2014) A highly parallel framework for hevc coding unit partitioning tree decision on many-core processors. IEEE Signal Process Lett 21(5):573–576CrossRefGoogle Scholar
  49. 49.
    Yan C, Zhang Y, Xu J, Dai F, Zhang J, Dai Q, Wu F (2014) Efficient parallel framework for hevc motion estimation on many-core processors. IEEE Trans Circuits Syst Video Technol 24(12):2077–2089CrossRefGoogle Scholar
  50. 50.
    Zhu Y, Chen W, Guo G (2014) Evaluating spatiotemporal interest point features for depth-based action recognition. Image Vision Comput 32(8):453–464CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Information Engineering and Computer ScienceUniversity of TrentoTrentoItaly
  2. 2.Department of ElectronicsQuaid-i-Azam UniversityIslamabadPakistan
  3. 3.Institute of Information Systems EngineeringVienna University of TechnologyViennaAustria

Personalised recommendations