Human Focused Action Localization in Video

  • Alexander Kläser
  • Marcin Marszałek
  • Cordelia Schmid
  • Andrew Zisserman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6553)


We propose a novel human-centric approach to detect and localize human actions in challenging video data, such as Hollywood movies. Our goal is to localize actions in time through the video and spatially in each frame. We achieve this by first obtaining generic spatio-temporal human tracks and then detecting specific actions within these using a sliding window classifier.

We make the following contributions: (i) We show that splitting the action localization task into spatial and temporal search leads to an efficient localization algorithm where generic human tracks can be reused to recognize multiple human actions; (ii) We develop a human detector and tracker which is able to cope with a wide range of postures, articulations, motions and camera viewpoints. The tracker includes detection interpolation and a principled classification stage to suppress false positive tracks; (iii) We propose a track-aligned 3D-HOG action representation, investigate its parameters, and show that action localization benefits from using tracks; and (iv) We introduce a new action localization dataset based on Hollywood movies.

Results are presented on a number of real-world movies with crowded, dynamic environment, partial occlusion and cluttered background. On the Coffee&Cigarettes dataset we significantly improve over the state of the art. Furthermore, we obtain excellent results on the new Hollywood–Localization dataset.


Action recognition localization human tracking HOG 


  1. 1.
    Schüldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local SVM approach. In: ICPR (2004)Google Scholar
  2. 2.
    Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: ICCV (2005)Google Scholar
  3. 3.
    Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: VS-PETS (2005)Google Scholar
  4. 4.
    Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: ICCV (2007)Google Scholar
  5. 5.
    Schindler, K., van Gool, L.: Action snippets: How many frames does human action recognition require? In: CVPR (2008)Google Scholar
  6. 6.
    Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  7. 7.
    Laptev, I., Perez, P.: Retrieving actions in movies. In: ICCV (2007)Google Scholar
  8. 8.
    Mikolajczyk, K., Uemura, H.: Action recognition with motion-appearance vocabulary forest. In: CVPR (2008)Google Scholar
  9. 9.
    Rodriguez, M., Ahmed, J., Shah, M.: Action MACH: A spatio-temporal maximum average correlation height filter for action recognition. In: CVPR (2008)Google Scholar
  10. 10.
    Everingham, M., van Gool, L., Williams, C., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge. In: Workshop in Conj. with ICCV (2009)Google Scholar
  11. 11.
    Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR (2009)Google Scholar
  12. 12.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)Google Scholar
  13. 13.
    Hu, Y., Cao, L., Lv, F., Yan, S., Gong, Y., Huang, T.S.: Action detection in complex scenes with spatial and temporal ambiguities. In: ICCV (2009)Google Scholar
  14. 14.
    Yuan, J., Liu, Z., Wu, Y.: Discriminative subvolume search for efficient action detection. In: CVPR (2009)Google Scholar
  15. 15.
    Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: ICCV (2003)Google Scholar
  16. 16.
    Lu, W.L., Little, J.J.: Simultaneous tracking and action recognition using the pca-hog descriptor. In: CRV (2006)Google Scholar
  17. 17.
    Duchenne, O., Laptev, I., Sivic, J., Bach, F., Ponce, J.: Automatic annotation of human actions in video. In: ICCV (2009)Google Scholar
  18. 18.
    Willems, G., Becker, J.H., Tuytelaars, T., van Gool, L.: Exemplar-based action recognition in video. In: BMVC (2009)Google Scholar
  19. 19.
    Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: ICCV (2007)Google Scholar
  20. 20.
    Cour, T., Jordan, C., Miltsakaki, E., Taskar, B.: Movie/Script: Alignment and Parsing of Video and Text Transcription. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 158–171. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  21. 21.
    Everingham, M., Sivic, J., Zisserman, A.: Hello! My name is... Buffy – automatic naming of characters in TV video. In: BMVC (2006)Google Scholar
  22. 22.
    Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: CVPR (2008)Google Scholar
  23. 23.
    Leibe, B., Schindler, K., van Gool, L.: Coupled detection and trajectory estimation for multi-object tracking. In: ICCV (2007)Google Scholar
  24. 24.
    Shi, J., Tomasi, C.: Good features to track. In: CVPR (1994)Google Scholar
  25. 25.
    Laptev, I.: Improvements of object detection using boosted histograms. In: BMVC (2006)Google Scholar
  26. 26.
    Everingham, M., Sivic, J., Zisserman, A.: Taking the bite out of automatic naming of characters in TV video. Image and Vision Computing 27, 545–559 (2009)CrossRefGoogle Scholar
  27. 27.
    Harzallah, H., Jurie, F., Schmid, C.: Combining efficient object localization and image classification. In: ICCV (2009)Google Scholar
  28. 28.
    Vedaldi, A., Gulshan, V., Varma, M., Zisserman, A.: Multiple kernels for object detection. In: ICCV (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Alexander Kläser
    • 1
  • Marcin Marszałek
    • 2
  • Cordelia Schmid
    • 1
  • Andrew Zisserman
    • 2
  1. 1.INRIA Grenoble, LEAR, LJKFrance
  2. 2.Engineering ScienceUniversity of OxfordUK

Personalised recommendations