Skip to main content
Log in

Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos

  • Special Issue Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

In this paper, a system is presented that can detect 48 human actions in realistic videos, ranging from simple actions such as ‘walk’ to complex actions such as ‘exchange’. We propose a method that gives a major contribution in performance. The reason for this major improvement is related to a different approach on three themes: sample selection, two-stage classification, and the combination of multiple features. First, we show that the sampling can be improved by smart selection of the negatives. Second, we show that exploiting all 48 actions’ posteriors by two-stage classification greatly improves its detection. Third, we show how low-level motion and high-level object features should be combined. These three yield a performance improvement of a factor 2.37 for human action detection in the visint.org test set of 1,294 realistic videos. In addition, we demonstrate that selective sampling and the two-stage setup improve on standard bag-of-feature methods on the UT-interaction dataset, and our method outperforms state-of-the-art for the IXMAS dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of ICPR, Cambridge, UK (2004)

  2. Gorelick, L., Blank, M., Shechtmanm, E., Irani, M., Basri, R.: Actions as space-time shapes. PAMI 29(12), 2247–2253 (2007)

    Article  Google Scholar 

  3. Guha, T., Ward, R.K.: Learning sparse representations for human action recognition. PAMI 34(8), 1576–1588 (2012)

    Article  Google Scholar 

  4. Ali, S., Shah, M.: Floor fields for tracking in high density crowd scenes. In: Computer vision-ECCV 2008, lecture notes in computer science, vol. 5303, pp. 1–14 (2008)

  5. Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: Proceedings of CVPR’09, Miami, US (2009)

  6. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: Proceedings of CVPR’09, Miami, US (2009)

  7. Sadanand, S., Corso, J.J.: Action bank: a high-level representation of activity in video. In: Proceedings of CVPR, pp. 1234–1241 (2012)

  8. http://www.visint.org/datasets.html

  9. Bouma, H., Hanckmann, P., Marck, J.-W., de Penning, L., den Hollander, R., ten Hove, J.-M., van den Broek, S.P., Schutte, K., Burghouts, G.J.: Automatic human action recognition in a scene from visual inputs. In: Proceedings of SPIE, vol. 8388 (2012)

  10. Wang, H., Ullah, M.M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: Proceedings of BMVC, p. 127 (2009)

  11. Moosmann, F., Triggs, B., Jurie, F.: Randomized clustering forests for building fast and discriminative visual vocabularies. In: Proceedings of NIPS (2006)

  12. Burghouts, G.J., Schutte, K.: Correlations between 48 human actions improve their detection. In: Proceedings of ICPR (2012)

  13. Bobick, A., Davis, J.: The recognition of human movement using temporal templates. PAMI 23(3), 257–267 (2001)

    Article  Google Scholar 

  14. Black, M., Yacoob, Y., Jepson, A., Fleet, D.: Learning parameterized models of image motion. In: Proceedings of CVPR (1997)

  15. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: Proceedings of ICCV (2003)

  16. Laptev, I.: On space-time interest points. In: Proceedings of IJCV, 64 (2/3) (2005)

  17. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of CVPR (2008)

  18. Mori, G., Belongie, S., Malik, J.: Efficient shape matching using shape contexts. PAMI 27(11), 1832–1837 (2005)

    Article  Google Scholar 

  19. Ramanan, D.: Learning to parse images of articulated bodies. In: Proceedings of NIPS (2006)

  20. Gupta, A., Srinivasan, P., Shi, J., Davis, L.S.: Understanding videos. Learning a visually grounded storyline model from annotated videos. In: Proceedigs of CVPR, constructing plots (2009)

  21. Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: combining multiple features for human action recognition. In: proceedings of ECCV (2010)

  22. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. PAMI 32(9), 1627–1645 (2010)

    Article  Google Scholar 

  23. Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking. In: Proceedings of CVPR (1999)

  24. Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–450 (2002)

    MATH  Google Scholar 

  25. Uijlings, S.J.R.R., Smeulders, A.W.M., Scha, R.J.H.: The visual extent of an object—suppose we know the object locations. In: Proceedings of IJCV (2012)

  26. Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. PAMI 22(1):4–37 (2000)

    Google Scholar 

  27. Li, X.: Snoek, C.G.M., Worring, M., Smeulders, A.W.M.: Social negative bootstrapping for visual categorization. In: Proceedings of international conference on multimedia retrieval (ICMR), Trento, Italy, April 2011

  28. Iyengar, G., Nock, H., Neti, C.: Discriminative model fusion for semantic concept detection and annotation in video. ACM Multimedia, Berkeley, pp. 255–258 (2003)

  29. Aytar, Y., Orhan, O.B., Shah M.: Improving semantic concept detection and retrieval using contextual estimates. In: Proceedings of ICME, pp. 536–539 (2007)

  30. Naphade, R.Yan.M.: Semi-supervised cross feature learning for semantic concept detection in videos. In: Proceedings of CVPR (2005)

  31. Vedaldi, A., Gulshan, V., Varma, M.: Zisserman. Multiple Kernels for object detection. In: Proceedings of ICCV (2009)

  32. Breiman, L.: Random forests. Mach. Learn. 45(1)5–32 (2001)

  33. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: a comprehensive study. IJCV 73(2):213–238 (2007)

    Google Scholar 

  34. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm (2001)

  35. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for object and scene recognition. PAMI 32(9):1582–1596 (2010)

    Google Scholar 

  36. Liu, H., Feris, R., Sun, M.T.: Benchmarking human activity recognition. CVPR Tutorial, CVPR (2012)

  37. Yuan, J., Liu, Z., Wu Y.: Discriminative subvolume search for efficient action detection. In: Proceedings of conference on computer vision and pattern recognition, 20–25 June 2009 at Kyoto, Japan (2009)

  38. Niebles, J.C., Chen, C., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Proceedings of ECCV (2010)

  39. Weinland, D., Boyer, E., Ronfard, R.: Action recognition from arbitrary views using 3D exemplars. In: Proceedings of ICCV (2007)

  40. Ryoo, M.S., Chen, C.-C., Aggarwal, J.K.: Roy-Chowdhury. An overview of contest on semantic description of human activities. In: Proceedings of ICPR (2010)

  41. Wu, X., Xu, D., Duan, L., Luo, J.: Action recognition using context and appearance distribution features. In: Proceedings of CVPR (2011)

  42. Waltisberg, D., Yao, A., Gall, J., Van Gool, L.: Variations of a Hough-voting action recognition system. In: Proceedings of ICPR (2010)

  43. Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: Proceedings of ICCV (2011)

  44. Yuan, J., Liu, Z., Wu, Y.: Discriminative video pattern search for efficient action detection. PAMI 33(9):1728–1743 (2011)

    Google Scholar 

  45. Burghouts, G.J., Geusebroek, J.M.: Quasi-periodic spatio-temporal filtering. IEEE Trans. Image Process 15(6):1572–1582 (2006)

    Google Scholar 

  46. Jiang, Y.G., Ye, G., Chang, S., Ellis, D., Loui, A.C.: Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: Proceedings of ACM international conference on multimedia retrieval ICMR (2011)

  47. Yu, G., Yuan, J., Liu, Z.: Propagative Hough voting for human activity recognition. In: Proceedings of ECCV (2012)

  48. Burghouts, G.J., Schutte, K.: Spatio-temporal layout of human actions for improved bag-of-words action detection. Pattern Recogn. Lett. (2013)

  49. Puertas, E., Escalera, S., Pujol, O.: Multi-class multi-scale stacked sequential learning. In: Proceedings of multiple classifier systems, pp. 197–206 (2011)

Download references

Acknowledgments

This work is supported by DARPA (mind’s eye program). The content of the information does not necessarily reflect the position or the policy of the US Government, and no official endorsement should be inferred. The authors acknowledge the CORTEX scientists for their significant contributions to the overall system: S. P. van den Broek, P. Hanckmann, J-W Marck, L. de Penning, J-M ten Hove, S. Landsmeer, C. van Leeuwen, A. Halma, M. Kruithof, S. Korzec, W. Ledegang and R. Wijn. Figure 3 has been contributed by E. Boertjes.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to G. J. Burghouts.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Burghouts, G.J., Schutte, K., Bouma, H. et al. Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos. Machine Vision and Applications 25, 85–98 (2014). https://doi.org/10.1007/s00138-013-0514-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00138-013-0514-0

Keywords

Navigation