Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off

  • J. Uijlings
  • I. C. Duta
  • E. Sangineto
  • Nicu Sebe
Regular Paper


The current state-of-the-art in video classification is based on Bag-of-Words using local visual descriptors. Most commonly these are histogram of oriented gradients (HOG), histogram of optical flow (HOF) and motion boundary histograms (MBH) descriptors. While such approach is very powerful for classification, it is also computationally expensive. This paper addresses the problem of computational efficiency. Specifically: (1) We propose several speed-ups for densely sampled HOG, HOF and MBH descriptors and release Matlab code; (2) We investigate the trade-off between accuracy and computational efficiency of descriptors in terms of frame sampling rate and type of Optical Flow method; (3) We investigate the trade-off between accuracy and computational efficiency for computing the feature vocabulary, using and comparing most of the commonly adopted vector quantization techniques: \(k\)-means, hierarchical \(k\)-means, Random Forests, Fisher Vectors and VLAD.


Video classification HOG HOF MBH Computational efficiency 



This work was supported by the European 7th Framework Program, under grant xLiMe (FP7-611346) and by the FIRB project S-PATTERNS.


  1. 1.
    Arandjelović R, Zisserman A (2012) Three things everyone should know to improve object retrieval. In: CVPRGoogle Scholar
  2. 2.
    Baker S, Scharstein D, Lewis JP, Roth S, Black MJ, Szeliski R (2011) A database and evaluation methodology for optical flow. Int J Comput Vis 92:1–31Google Scholar
  3. 3.
    Bay H, Ess A, Tuytelaars T, Van L (2008) Speeded-Up Robust Features (SURF). Comput Vis Image Underst 110:346–359CrossRefGoogle Scholar
  4. 4.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefzbMATHGoogle Scholar
  5. 5.
    Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: ECCV, pp 25–36Google Scholar
  6. 6.
    Brox T, Malik J (2011) Large displacement optical flow: descriptor matching in variational motion estimation. PAMI 33(3):500–513CrossRefGoogle Scholar
  7. 7.
    Butler DJ, Wulff J, Stanley GB, Black MJ (2012) A naturalistic open source movie for optical flow evaluation. In: ECCVGoogle Scholar
  8. 8.
    Chang C-C, Lin C-J (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol.
  9. 9.
    Chatfield K, Lempitsky V, Vedaldi A, Zisserman A (2011) The devil is in the details: an evaluation of recent feature encoding methods. In: BMVCGoogle Scholar
  10. 10.
    Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: ECCV international workshop on statistical learning in computer vision, PragueGoogle Scholar
  11. 11.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: CVPRGoogle Scholar
  12. 12.
    Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: ECCVGoogle Scholar
  13. 13.
    Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: VS-PETSGoogle Scholar
  14. 14.
    Everts I, van Gemert J, Gevers T (2013) Evaluation of color STIPs for human action recognition. In: CVPRGoogle Scholar
  15. 15.
    Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Scandinavian conference on image analysisGoogle Scholar
  16. 16.
    Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42CrossRefzbMATHGoogle Scholar
  17. 17.
    Horn B, Schunck B (1981) Determining optical flow. Artif Intell 17:185–203CrossRefGoogle Scholar
  18. 18.
    Jaakkola T, Haussler D (1999) Exploiting generative models in discriminative classifiers. In: NIPSGoogle Scholar
  19. 19.
    Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: CVPR, pp 3304–3311Google Scholar
  20. 20.
    Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. In: ICCVGoogle Scholar
  21. 21.
    Karaman S, Seidenari L, Bagdanov A, del Bimbo A (2013) L1-regularized logistic regression stacking and transductive CRF smoothing for action recognition in video. In: ICCV workshop on action recognition with a large number of classesGoogle Scholar
  22. 22.
    Kläser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: BMVCGoogle Scholar
  23. 23.
    Kliper-Gross O, Gurovich Y, Hassner T, Wolf L (2012) Motion interchange patterns for action recognition in unconstrained videos. In: ECCVGoogle Scholar
  24. 24.
    Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: ICCVGoogle Scholar
  25. 25.
    Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: CVPRGoogle Scholar
  26. 26.
    Lazebnik S, Schmid C, Ponce J (2006) Spatial pyramid matching for recognizing natural scene categories. In: CVPR. Beyond Bags of FeaturesGoogle Scholar
  27. 27.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. IJCV 60:91–110CrossRefGoogle Scholar
  28. 28.
    Lucas B, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: International joint conference on artificial intelligenceGoogle Scholar
  29. 29.
    Maji S, Berg AC, Malik J (2008) Classification using intersection kernel support vector machines is efficient. In: CVPRGoogle Scholar
  30. 30.
    Moosmann F, Nowak E, Jurie F (2008) Randomized clustering forests for image classification. IEEE Trans Pattern Anal Mach Intell 9:1632–1646CrossRefGoogle Scholar
  31. 31.
    Perronnin F, Sanchez J, Mensink T (2010) Improving the Fisher kernel for large-scale image classification. In: ECCVGoogle Scholar
  32. 32.
    Reddy K, Shah M (2013) Recognizing 50 human action categories of web videos. Mach Vis Appl 24(5):971–981Google Scholar
  33. 33.
    Sánchez J, Perronnin F, Mensink T, Verbeek JJ (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105(3):222–245CrossRefzbMATHMathSciNetGoogle Scholar
  34. 34.
    Sangineto E (2013) Pose and expression independent facial landmark localization using dense-SURF and the Hausdorff distance. IEEE Trans Pattern Anal Mach Intell 35(3):624–638Google Scholar
  35. 35.
    Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: ICIPGoogle Scholar
  36. 36.
    Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: ACM MMGoogle Scholar
  37. 37.
    Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: ICCVGoogle Scholar
  38. 38.
    Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVID. In: ACM SIGMM international workshop on multimedia information retrieval (MIR)Google Scholar
  39. 39.
    Snoek CGM, Worring M, Gemert J, Geusebroek J, Smeulders A (2006) The challenge problem for automated detection of 101 semantic concepts in multimedia. In: ACM MMGoogle Scholar
  40. 40.
    Solmaz B, Assari SM, Shah M (2013) Classifying web videos using a global video descriptor. Mach Vis Appl 24(7):1473–1485Google Scholar
  41. 41.
    Sun D, Roth S, Black M (2014) A quantitative analysis of current practices in optical flow estimation and the principles behind them. Int J Comput Vis 106:115–137Google Scholar
  42. 42.
    Uijlings JRR, Smeulders AWM, Scha RJH (2010) Real-time visual concept classification. IEEE Trans Multimed 12(7):665–681Google Scholar
  43. 43.
    Vedaldi A, Fulkerson B (2010) VLFeat—an open and portable library of computer vision algorithms. In: ACM MMGoogle Scholar
  44. 44.
    Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. Proc CVPR 1:511–518Google Scholar
  45. 45.
    Wang H, Kläser A, Schmid C, Liu C (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103:60–79CrossRefMathSciNetGoogle Scholar
  46. 46.
    Wang H, Ullah M, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVCGoogle Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  • J. Uijlings
    • 1
  • I. C. Duta
    • 2
  • E. Sangineto
    • 2
  • Nicu Sebe
    • 2
  1. 1.University of EdinburghEdinburghUK
  2. 2.DISIUniversity of TrentoTrentoItaly

Personalised recommendations