Recognizing Human Actions by Using Effective Codebooks and Tracking

  • Lamberto BallanEmail author
  • Lorenzo Seidenari
  • Giuseppe Serra
  • Marco Bertini
  • Alberto Del Bimbo
Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR)


Recognition and classification of human actions for annotation of unconstrained video sequences has proven to be challenging because of the variations in the environment, appearance of actors, modalities in which the same action is performed by different persons, speed and duration and points of view from which the event is observed. This variability reflects in the difficulty of defining effective descriptors and deriving appropriate and effective codebooks for action categorization. In this chapter, we present a novel and effective solution to classify human actions in unconstrained videos. In the formation of the codebook, we employ radius-based clustering with soft assignment in order to create a rich vocabulary that may account for the high variability of human actions. We show that our solution scores very good performance with no need of parameter tuning. We also show that a strong reduction of computation time can be obtained by applying codebook size reduction with Deep Belief Networks with little loss of accuracy.


Video Sequence Optic Flow Action Recognition Interest Point Late Fusion 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Arulampalam M, Maskell S, Gordon N, Clapp T (2002) A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans Signal Process 50(2):174–188 CrossRefGoogle Scholar
  2. 2.
    Bagdanov AD, Dini F, Del Bimbo A, Nunziati W (2007) Improving the robustness of particle filter-based visual trackers using online parameter adaptation. In: Proc of AVSS Google Scholar
  3. 3.
    Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2011) Event detection and recognition for semantic annotation of video. Multimed Tools Appl 51(1):279–302 CrossRefGoogle Scholar
  4. 4.
    Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2012) Effective codebooks for human action representation and classification in unconstrained videos. IEEE Trans Multimed 14(4):1234–1245 CrossRefGoogle Scholar
  5. 5.
    Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J Image Video Process 2008:246309 CrossRefGoogle Scholar
  6. 6.
    Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267 CrossRefGoogle Scholar
  7. 7.
    Bregonzio M, Gong S, Xiang T (2009) Recognising action as clouds of space-time interest points. In: Proc of CVPR Google Scholar
  8. 8.
    Cao L, Zicheng L, Huang T (2010) Cross-dataset action detection. In: Proc of CVPR Google Scholar
  9. 9.
    Carreira Perpinan MA, Hinton GE (2005) On contrastive divergence learning. In: Proc of AISTATS Google Scholar
  10. 10.
    Chen MY, Hauptmann AG (2009) MoSIFT: recognizing human actions in surveillance videos. Technical report, CMU Google Scholar
  11. 11.
    Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619 CrossRefGoogle Scholar
  12. 12.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proc of CVPR Google Scholar
  13. 13.
    Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Proc of VSPETS Google Scholar
  14. 14.
    Efros AA, Berg AC, Mori G, Malik J (2003) Recognizing action at a distance. In: Proc of ICCV Google Scholar
  15. 15.
    Fergus R, Perona P, Zisserman A (2003) Object class recognition by unsupervised scale-invariant learning. In: Proc of CVPR Google Scholar
  16. 16.
    Gao Z, Chen MY, Hauptmann AG, Cai A (2010) Comparing evaluation protocols on the KTH dataset. In: Proc of HBU workshop Google Scholar
  17. 17.
    Gorelick L, Blank M, Schechtman E, Irani M, Basri R (2007) Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intell 29(12):2247–2253 CrossRefGoogle Scholar
  18. 18.
    Hauptmann AG, Christel MG, Yan R (2008) Video retrieval based on semantic concepts. Proc IEEE 96(4):602–622 CrossRefGoogle Scholar
  19. 19.
    Hinton EG, Salakhutdinov R (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507 MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Hinton EG, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554 MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Jiang YG, Yang J, Ngo CW, Hauptmann AG (2010) Representations of keypoint-based semantic concept detection: a comprehensive study. IEEE Trans Multimed 12(1):42–53 CrossRefGoogle Scholar
  22. 22.
    Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. In: Proc of ICCV Google Scholar
  23. 23.
    Kläser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: Proc of BMVC Google Scholar
  24. 24.
    Kong Y, Zhang X, Hu W, Jia Y (2011) Adaptive learning codebook for action recognition. Pattern Recognit Lett 32(8):1178–1186 CrossRefGoogle Scholar
  25. 25.
    Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: Proc of CVPR Google Scholar
  26. 26.
    Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123 CrossRefGoogle Scholar
  27. 27.
    Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proc of CVPR Google Scholar
  28. 28.
    Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proc of CVPR Google Scholar
  29. 29.
    Lin Z, Jiang Z, Davis LS (2009) Recognizing actions by shape-motion prototype trees. In: Proc of ICCV Google Scholar
  30. 30.
    Liu J, Shah M (2008) Learning human actions via information maximization. In: Proc of CVPR Google Scholar
  31. 31.
    Liu J, Ali S, Shah M (2008) Recognizing human actions using multiple features. In: Proc of CVPR Google Scholar
  32. 32.
    Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. In: Proc of CVPR Google Scholar
  33. 33.
    Lucas BD, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: Proc of DARPA IU workshop Google Scholar
  34. 34.
    Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: Proc of CVPR Google Scholar
  35. 35.
    Mikolajczyk K, Uemura H (2008) Action recognition with motion-appearance vocabulary forest. In: Proc of CVPR Google Scholar
  36. 36.
    Mikolajczyk K, Leibe B, Schiele B (2005) Local features for object class recognition. In: Proc of ICCV Google Scholar
  37. 37.
    Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, Matas J, Schaffalitzky F, Kadir T, Van Gool L (2005) A comparison of affine region detectors. Int J Comput Vis 65(1/2):43–72 CrossRefGoogle Scholar
  38. 38.
    Moeslund T, Hilton A, Krüger V (2006) A survey of advances in vision-based human motion capture and analysis. Comput Vis Image Underst 104(2–3):90–126 CrossRefGoogle Scholar
  39. 39.
    Niebles JC, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis 79(3):299–318 CrossRefGoogle Scholar
  40. 40.
    Poppe R (2007) Vision-based human motion analysis: an overview. Comput Vis Image Underst 108(1–2):4–18 CrossRefGoogle Scholar
  41. 41.
    Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990 CrossRefGoogle Scholar
  42. 42.
    Rapantzikos K, Avrithis Y, Kollia S (2009) Dense saliency-based spatiotemporal feature points for action recognition. In: Proc of CVPR Google Scholar
  43. 43.
    Schüldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proc of ICPR Google Scholar
  44. 44.
    Scovanner P, Ali S, Shah M (2007) A 3-dimensional SIFT descriptor and its application to action recognition. In: Proc of ACM multimedia Google Scholar
  45. 45.
    Shao L, Mattivi R (2010) Feature detector and descriptor evaluation in human action recognition. In: Proc of CIVR Google Scholar
  46. 46.
    Shao L, Gao R, Liu Y, Zhang H (2011) Transform based spatio-temporal descriptors for human action recognition. Neurocomputing 74(6):962–973 CrossRefGoogle Scholar
  47. 47.
    Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proc of ICCV Google Scholar
  48. 48.
    Snoek CGM, Worring M, van Gemert JC, Geusebroek JM, Smeulders AWM (2006) The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proc of ACM multimedia Google Scholar
  49. 49.
    Sun X, Chen M, Hauptmann AG (2009) Action recognition via local descriptors and holistic features. In: Proc of CVPR4HB workshop Google Scholar
  50. 50.
    Turaga P, Chellappa R, Subrahmanian V, Udrea O (2008) Machine recognition of human activities: a survey. IEEE Trans Circuits Syst Video Technol 18(11):1473–1488 CrossRefGoogle Scholar
  51. 51.
    van der Maaten L, Postma E, van den Herik H (2009) Dimensionality reduction: a comparative review. Technical report TiCC-TR 2009-005, Tilburg University Google Scholar
  52. 52.
    van Gemert JC, Veenman CJ, Smeulders AWM, Geusebroek JM (2010) Visual word ambiguity. IEEE Trans Pattern Anal Mach Intell 32(7):1271–1283 CrossRefGoogle Scholar
  53. 53.
    Vezzani R, Cucchiara R (2010) Video surveillance online repository (ViSOR): an integrated framework. Multimed Tools Appl 50(2):359–380 CrossRefGoogle Scholar
  54. 54.
    Wang Y, Mori G (2009) Max-margin hidden conditional random fields for human action recognition. In: Proc of CVPR Google Scholar
  55. 55.
    Wang H, Ullah MM, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: Proc of BMVC Google Scholar
  56. 56.
    Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: Proc of ECCV Google Scholar
  57. 57.
    Wong SF, Cipolla R (2007) Extracting spatiotemporal interest points using global information. In: Proc of ICCV Google Scholar
  58. 58.
    Wu B, Nevatia R (2007) Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors. Int J Comput Vis 75(2):247–266 CrossRefGoogle Scholar
  59. 59.
    Yao A, Gall J, Van Gool L (2010) A hough transform-based voting framework for action recognition. In: Proc of CVPR Google Scholar
  60. 60.
    Yilmaz A, Shah M (2005) Actions sketch: a novel action representation. In: Proc of CVPR Google Scholar
  61. 61.
    Yu G, Goussies N, Yuan J, Liu Z (2011) Fast action detection via discriminative random forest voting and top-k subvolume search. IEEE Trans Multimed 13(3):507–517 CrossRefGoogle Scholar
  62. 62.
    Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Lamberto Ballan
    • 1
    Email author
  • Lorenzo Seidenari
    • 1
  • Giuseppe Serra
    • 1
  • Marco Bertini
    • 1
  • Alberto Del Bimbo
    • 1
  1. 1.Media Integration and Communication CenterUniversity of FlorenceFlorenceItaly

Personalised recommendations