Abstract
In this paper, a system is presented that can detect 48 human actions in realistic videos, ranging from simple actions such as ‘walk’ to complex actions such as ‘exchange’. We propose a method that gives a major contribution in performance. The reason for this major improvement is related to a different approach on three themes: sample selection, two-stage classification, and the combination of multiple features. First, we show that the sampling can be improved by smart selection of the negatives. Second, we show that exploiting all 48 actions’ posteriors by two-stage classification greatly improves its detection. Third, we show how low-level motion and high-level object features should be combined. These three yield a performance improvement of a factor 2.37 for human action detection in the visint.org test set of 1,294 realistic videos. In addition, we demonstrate that selective sampling and the two-stage setup improve on standard bag-of-feature methods on the UT-interaction dataset, and our method outperforms state-of-the-art for the IXMAS dataset.
Similar content being viewed by others
References
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of ICPR, Cambridge, UK (2004)
Gorelick, L., Blank, M., Shechtmanm, E., Irani, M., Basri, R.: Actions as space-time shapes. PAMI 29(12), 2247–2253 (2007)
Guha, T., Ward, R.K.: Learning sparse representations for human action recognition. PAMI 34(8), 1576–1588 (2012)
Ali, S., Shah, M.: Floor fields for tracking in high density crowd scenes. In: Computer vision-ECCV 2008, lecture notes in computer science, vol. 5303, pp. 1–14 (2008)
Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: Proceedings of CVPR’09, Miami, US (2009)
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: Proceedings of CVPR’09, Miami, US (2009)
Sadanand, S., Corso, J.J.: Action bank: a high-level representation of activity in video. In: Proceedings of CVPR, pp. 1234–1241 (2012)
Bouma, H., Hanckmann, P., Marck, J.-W., de Penning, L., den Hollander, R., ten Hove, J.-M., van den Broek, S.P., Schutte, K., Burghouts, G.J.: Automatic human action recognition in a scene from visual inputs. In: Proceedings of SPIE, vol. 8388 (2012)
Wang, H., Ullah, M.M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: Proceedings of BMVC, p. 127 (2009)
Moosmann, F., Triggs, B., Jurie, F.: Randomized clustering forests for building fast and discriminative visual vocabularies. In: Proceedings of NIPS (2006)
Burghouts, G.J., Schutte, K.: Correlations between 48 human actions improve their detection. In: Proceedings of ICPR (2012)
Bobick, A., Davis, J.: The recognition of human movement using temporal templates. PAMI 23(3), 257–267 (2001)
Black, M., Yacoob, Y., Jepson, A., Fleet, D.: Learning parameterized models of image motion. In: Proceedings of CVPR (1997)
Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: Proceedings of ICCV (2003)
Laptev, I.: On space-time interest points. In: Proceedings of IJCV, 64 (2/3) (2005)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of CVPR (2008)
Mori, G., Belongie, S., Malik, J.: Efficient shape matching using shape contexts. PAMI 27(11), 1832–1837 (2005)
Ramanan, D.: Learning to parse images of articulated bodies. In: Proceedings of NIPS (2006)
Gupta, A., Srinivasan, P., Shi, J., Davis, L.S.: Understanding videos. Learning a visually grounded storyline model from annotated videos. In: Proceedigs of CVPR, constructing plots (2009)
Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: combining multiple features for human action recognition. In: proceedings of ECCV (2010)
Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. PAMI 32(9), 1627–1645 (2010)
Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking. In: Proceedings of CVPR (1999)
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–450 (2002)
Uijlings, S.J.R.R., Smeulders, A.W.M., Scha, R.J.H.: The visual extent of an object—suppose we know the object locations. In: Proceedings of IJCV (2012)
Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. PAMI 22(1):4–37 (2000)
Li, X.: Snoek, C.G.M., Worring, M., Smeulders, A.W.M.: Social negative bootstrapping for visual categorization. In: Proceedings of international conference on multimedia retrieval (ICMR), Trento, Italy, April 2011
Iyengar, G., Nock, H., Neti, C.: Discriminative model fusion for semantic concept detection and annotation in video. ACM Multimedia, Berkeley, pp. 255–258 (2003)
Aytar, Y., Orhan, O.B., Shah M.: Improving semantic concept detection and retrieval using contextual estimates. In: Proceedings of ICME, pp. 536–539 (2007)
Naphade, R.Yan.M.: Semi-supervised cross feature learning for semantic concept detection in videos. In: Proceedings of CVPR (2005)
Vedaldi, A., Gulshan, V., Varma, M.: Zisserman. Multiple Kernels for object detection. In: Proceedings of ICCV (2009)
Breiman, L.: Random forests. Mach. Learn. 45(1)5–32 (2001)
Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: a comprehensive study. IJCV 73(2):213–238 (2007)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm (2001)
van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for object and scene recognition. PAMI 32(9):1582–1596 (2010)
Liu, H., Feris, R., Sun, M.T.: Benchmarking human activity recognition. CVPR Tutorial, CVPR (2012)
Yuan, J., Liu, Z., Wu Y.: Discriminative subvolume search for efficient action detection. In: Proceedings of conference on computer vision and pattern recognition, 20–25 June 2009 at Kyoto, Japan (2009)
Niebles, J.C., Chen, C., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Proceedings of ECCV (2010)
Weinland, D., Boyer, E., Ronfard, R.: Action recognition from arbitrary views using 3D exemplars. In: Proceedings of ICCV (2007)
Ryoo, M.S., Chen, C.-C., Aggarwal, J.K.: Roy-Chowdhury. An overview of contest on semantic description of human activities. In: Proceedings of ICPR (2010)
Wu, X., Xu, D., Duan, L., Luo, J.: Action recognition using context and appearance distribution features. In: Proceedings of CVPR (2011)
Waltisberg, D., Yao, A., Gall, J., Van Gool, L.: Variations of a Hough-voting action recognition system. In: Proceedings of ICPR (2010)
Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: Proceedings of ICCV (2011)
Yuan, J., Liu, Z., Wu, Y.: Discriminative video pattern search for efficient action detection. PAMI 33(9):1728–1743 (2011)
Burghouts, G.J., Geusebroek, J.M.: Quasi-periodic spatio-temporal filtering. IEEE Trans. Image Process 15(6):1572–1582 (2006)
Jiang, Y.G., Ye, G., Chang, S., Ellis, D., Loui, A.C.: Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: Proceedings of ACM international conference on multimedia retrieval ICMR (2011)
Yu, G., Yuan, J., Liu, Z.: Propagative Hough voting for human activity recognition. In: Proceedings of ECCV (2012)
Burghouts, G.J., Schutte, K.: Spatio-temporal layout of human actions for improved bag-of-words action detection. Pattern Recogn. Lett. (2013)
Puertas, E., Escalera, S., Pujol, O.: Multi-class multi-scale stacked sequential learning. In: Proceedings of multiple classifier systems, pp. 197–206 (2011)
Acknowledgments
This work is supported by DARPA (mind’s eye program). The content of the information does not necessarily reflect the position or the policy of the US Government, and no official endorsement should be inferred. The authors acknowledge the CORTEX scientists for their significant contributions to the overall system: S. P. van den Broek, P. Hanckmann, J-W Marck, L. de Penning, J-M ten Hove, S. Landsmeer, C. van Leeuwen, A. Halma, M. Kruithof, S. Korzec, W. Ledegang and R. Wijn. Figure 3 has been contributed by E. Boertjes.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Burghouts, G.J., Schutte, K., Bouma, H. et al. Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos. Machine Vision and Applications 25, 85–98 (2014). https://doi.org/10.1007/s00138-013-0514-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00138-013-0514-0