Machine Vision and Applications

, Volume 25, Issue 1, pp 85–98 | Cite as

Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos

  • G. J. BurghoutsEmail author
  • K. Schutte
  • H. Bouma
  • R. J. M. den Hollander
Special Issue Paper


In this paper, a system is presented that can detect 48 human actions in realistic videos, ranging from simple actions such as ‘walk’ to complex actions such as ‘exchange’. We propose a method that gives a major contribution in performance. The reason for this major improvement is related to a different approach on three themes: sample selection, two-stage classification, and the combination of multiple features. First, we show that the sampling can be improved by smart selection of the negatives. Second, we show that exploiting all 48 actions’ posteriors by two-stage classification greatly improves its detection. Third, we show how low-level motion and high-level object features should be combined. These three yield a performance improvement of a factor 2.37 for human action detection in the test set of 1,294 realistic videos. In addition, we demonstrate that selective sampling and the two-stage setup improve on standard bag-of-feature methods on the UT-interaction dataset, and our method outperforms state-of-the-art for the IXMAS dataset.


Human action detection Sparse representation Pose estimation Interactions between people Spatiotemporal features STIP Tracking of humans Person detection  Event recognition  Random forest Support vector machines 



This work is supported by DARPA (mind’s eye program). The content of the information does not necessarily reflect the position or the policy of the US Government, and no official endorsement should be inferred. The authors acknowledge the CORTEX scientists for their significant contributions to the overall system: S. P. van den Broek, P. Hanckmann, J-W Marck, L. de Penning, J-M ten Hove, S. Landsmeer, C. van Leeuwen, A. Halma, M. Kruithof, S. Korzec, W. Ledegang and R. Wijn. Figure 3 has been contributed by E. Boertjes.


  1. 1.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of ICPR, Cambridge, UK (2004)Google Scholar
  2. 2.
    Gorelick, L., Blank, M., Shechtmanm, E., Irani, M., Basri, R.: Actions as space-time shapes. PAMI 29(12), 2247–2253 (2007)CrossRefGoogle Scholar
  3. 3.
    Guha, T., Ward, R.K.: Learning sparse representations for human action recognition. PAMI 34(8), 1576–1588 (2012)CrossRefGoogle Scholar
  4. 4.
    Ali, S., Shah, M.: Floor fields for tracking in high density crowd scenes. In: Computer vision-ECCV 2008, lecture notes in computer science, vol. 5303, pp. 1–14 (2008)Google Scholar
  5. 5.
    Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: Proceedings of CVPR’09, Miami, US (2009)Google Scholar
  6. 6.
    Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: Proceedings of CVPR’09, Miami, US (2009)Google Scholar
  7. 7.
    Sadanand, S., Corso, J.J.: Action bank: a high-level representation of activity in video. In: Proceedings of CVPR, pp. 1234–1241 (2012)Google Scholar
  8. 8.
  9. 9.
    Bouma, H., Hanckmann, P., Marck, J.-W., de Penning, L., den Hollander, R., ten Hove, J.-M., van den Broek, S.P., Schutte, K., Burghouts, G.J.: Automatic human action recognition in a scene from visual inputs. In: Proceedings of SPIE, vol. 8388 (2012)Google Scholar
  10. 10.
    Wang, H., Ullah, M.M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: Proceedings of BMVC, p. 127 (2009)Google Scholar
  11. 11.
    Moosmann, F., Triggs, B., Jurie, F.: Randomized clustering forests for building fast and discriminative visual vocabularies. In: Proceedings of NIPS (2006)Google Scholar
  12. 12.
    Burghouts, G.J., Schutte, K.: Correlations between 48 human actions improve their detection. In: Proceedings of ICPR (2012)Google Scholar
  13. 13.
    Bobick, A., Davis, J.: The recognition of human movement using temporal templates. PAMI 23(3), 257–267 (2001)CrossRefGoogle Scholar
  14. 14.
    Black, M., Yacoob, Y., Jepson, A., Fleet, D.: Learning parameterized models of image motion. In: Proceedings of CVPR (1997)Google Scholar
  15. 15.
    Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: Proceedings of ICCV (2003)Google Scholar
  16. 16.
    Laptev, I.: On space-time interest points. In: Proceedings of IJCV, 64 (2/3) (2005)Google Scholar
  17. 17.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of CVPR (2008)Google Scholar
  18. 18.
    Mori, G., Belongie, S., Malik, J.: Efficient shape matching using shape contexts. PAMI 27(11), 1832–1837 (2005)CrossRefGoogle Scholar
  19. 19.
    Ramanan, D.: Learning to parse images of articulated bodies. In: Proceedings of NIPS (2006)Google Scholar
  20. 20.
    Gupta, A., Srinivasan, P., Shi, J., Davis, L.S.: Understanding videos. Learning a visually grounded storyline model from annotated videos. In: Proceedigs of CVPR, constructing plots (2009)Google Scholar
  21. 21.
    Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: combining multiple features for human action recognition. In: proceedings of ECCV (2010)Google Scholar
  22. 22.
    Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. PAMI 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  23. 23.
    Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking. In: Proceedings of CVPR (1999)Google Scholar
  24. 24.
    Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–450 (2002)zbMATHGoogle Scholar
  25. 25.
    Uijlings, S.J.R.R., Smeulders, A.W.M., Scha, R.J.H.: The visual extent of an object—suppose we know the object locations. In: Proceedings of IJCV (2012)Google Scholar
  26. 26.
    Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. PAMI 22(1):4–37 (2000)Google Scholar
  27. 27.
    Li, X.: Snoek, C.G.M., Worring, M., Smeulders, A.W.M.: Social negative bootstrapping for visual categorization. In: Proceedings of international conference on multimedia retrieval (ICMR), Trento, Italy, April 2011Google Scholar
  28. 28.
    Iyengar, G., Nock, H., Neti, C.: Discriminative model fusion for semantic concept detection and annotation in video. ACM Multimedia, Berkeley, pp. 255–258 (2003)Google Scholar
  29. 29.
    Aytar, Y., Orhan, O.B., Shah M.: Improving semantic concept detection and retrieval using contextual estimates. In: Proceedings of ICME, pp. 536–539 (2007) Google Scholar
  30. 30.
    Naphade, R.Yan.M.: Semi-supervised cross feature learning for semantic concept detection in videos. In: Proceedings of CVPR (2005)Google Scholar
  31. 31.
    Vedaldi, A., Gulshan, V., Varma, M.: Zisserman. Multiple Kernels for object detection. In: Proceedings of ICCV (2009)Google Scholar
  32. 32.
    Breiman, L.: Random forests. Mach. Learn. 45(1)5–32 (2001)Google Scholar
  33. 33.
    Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: a comprehensive study. IJCV 73(2):213–238 (2007)Google Scholar
  34. 34.
    Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. Software available at cjlin/libsvm (2001)
  35. 35.
    van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for object and scene recognition. PAMI 32(9):1582–1596 (2010)Google Scholar
  36. 36.
    Liu, H., Feris, R., Sun, M.T.: Benchmarking human activity recognition. CVPR Tutorial, CVPR (2012)Google Scholar
  37. 37.
    Yuan, J., Liu, Z., Wu Y.: Discriminative subvolume search for efficient action detection. In: Proceedings of conference on computer vision and pattern recognition, 20–25 June 2009 at Kyoto, Japan (2009)Google Scholar
  38. 38.
    Niebles, J.C., Chen, C., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Proceedings of ECCV (2010)Google Scholar
  39. 39.
    Weinland, D., Boyer, E., Ronfard, R.: Action recognition from arbitrary views using 3D exemplars. In: Proceedings of ICCV (2007)Google Scholar
  40. 40.
    Ryoo, M.S., Chen, C.-C., Aggarwal, J.K.: Roy-Chowdhury. An overview of contest on semantic description of human activities. In: Proceedings of ICPR (2010)Google Scholar
  41. 41.
    Wu, X., Xu, D., Duan, L., Luo, J.: Action recognition using context and appearance distribution features. In: Proceedings of CVPR (2011)Google Scholar
  42. 42.
    Waltisberg, D., Yao, A., Gall, J., Van Gool, L.: Variations of a Hough-voting action recognition system. In: Proceedings of ICPR (2010)Google Scholar
  43. 43.
    Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: Proceedings of ICCV (2011)Google Scholar
  44. 44.
    Yuan, J., Liu, Z., Wu, Y.: Discriminative video pattern search for efficient action detection. PAMI 33(9):1728–1743 (2011)Google Scholar
  45. 45.
    Burghouts, G.J., Geusebroek, J.M.: Quasi-periodic spatio-temporal filtering. IEEE Trans. Image Process 15(6):1572–1582 (2006)Google Scholar
  46. 46.
    Jiang, Y.G., Ye, G., Chang, S., Ellis, D., Loui, A.C.: Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: Proceedings of ACM international conference on multimedia retrieval ICMR (2011)Google Scholar
  47. 47.
    Yu, G., Yuan, J., Liu, Z.: Propagative Hough voting for human activity recognition. In: Proceedings of ECCV (2012)Google Scholar
  48. 48.
    Burghouts, G.J., Schutte, K.: Spatio-temporal layout of human actions for improved bag-of-words action detection. Pattern Recogn. Lett. (2013)Google Scholar
  49. 49.
    Puertas, E., Escalera, S., Pujol, O.: Multi-class multi-scale stacked sequential learning. In: Proceedings of multiple classifier systems, pp. 197–206 (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • G. J. Burghouts
    • 1
    Email author
  • K. Schutte
    • 1
  • H. Bouma
    • 1
  • R. J. M. den Hollander
    • 1
  1. 1.Intelligent ImagingTNOThe HagueThe Netherlands

Personalised recommendations