Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos

Burghouts, G. J.; Schutte, K.; Bouma, H.; den Hollander, R. J. M.

doi:10.1007/s00138-013-0514-0

Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos

Special Issue Paper
Published: 14 May 2013

Volume 25, pages 85–98, (2014)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

G. J. Burghouts¹,
K. Schutte¹,
H. Bouma¹ &
…
R. J. M. den Hollander¹

636 Accesses
21 Citations
7 Altmetric
1 Mention
Explore all metrics

Abstract

In this paper, a system is presented that can detect 48 human actions in realistic videos, ranging from simple actions such as ‘walk’ to complex actions such as ‘exchange’. We propose a method that gives a major contribution in performance. The reason for this major improvement is related to a different approach on three themes: sample selection, two-stage classification, and the combination of multiple features. First, we show that the sampling can be improved by smart selection of the negatives. Second, we show that exploiting all 48 actions’ posteriors by two-stage classification greatly improves its detection. Third, we show how low-level motion and high-level object features should be combined. These three yield a performance improvement of a factor 2.37 for human action detection in the visint.org test set of 1,294 realistic videos. In addition, we demonstrate that selective sampling and the two-stage setup improve on standard bag-of-feature methods on the UT-interaction dataset, and our method outperforms state-of-the-art for the IXMAS dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Online Action Detection

HMDB51: A Large Video Database for Human Motion Recognition

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

Article 26 April 2016

References

Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of ICPR, Cambridge, UK (2004)
Gorelick, L., Blank, M., Shechtmanm, E., Irani, M., Basri, R.: Actions as space-time shapes. PAMI 29(12), 2247–2253 (2007)
Article Google Scholar
Guha, T., Ward, R.K.: Learning sparse representations for human action recognition. PAMI 34(8), 1576–1588 (2012)
Article Google Scholar
Ali, S., Shah, M.: Floor fields for tracking in high density crowd scenes. In: Computer vision-ECCV 2008, lecture notes in computer science, vol. 5303, pp. 1–14 (2008)
Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: Proceedings of CVPR’09, Miami, US (2009)
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: Proceedings of CVPR’09, Miami, US (2009)
Sadanand, S., Corso, J.J.: Action bank: a high-level representation of activity in video. In: Proceedings of CVPR, pp. 1234–1241 (2012)
http://www.visint.org/datasets.html
Bouma, H., Hanckmann, P., Marck, J.-W., de Penning, L., den Hollander, R., ten Hove, J.-M., van den Broek, S.P., Schutte, K., Burghouts, G.J.: Automatic human action recognition in a scene from visual inputs. In: Proceedings of SPIE, vol. 8388 (2012)
Wang, H., Ullah, M.M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: Proceedings of BMVC, p. 127 (2009)
Moosmann, F., Triggs, B., Jurie, F.: Randomized clustering forests for building fast and discriminative visual vocabularies. In: Proceedings of NIPS (2006)
Burghouts, G.J., Schutte, K.: Correlations between 48 human actions improve their detection. In: Proceedings of ICPR (2012)
Bobick, A., Davis, J.: The recognition of human movement using temporal templates. PAMI 23(3), 257–267 (2001)
Article Google Scholar
Black, M., Yacoob, Y., Jepson, A., Fleet, D.: Learning parameterized models of image motion. In: Proceedings of CVPR (1997)
Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: Proceedings of ICCV (2003)
Laptev, I.: On space-time interest points. In: Proceedings of IJCV, 64 (2/3) (2005)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of CVPR (2008)
Mori, G., Belongie, S., Malik, J.: Efficient shape matching using shape contexts. PAMI 27(11), 1832–1837 (2005)
Article Google Scholar
Ramanan, D.: Learning to parse images of articulated bodies. In: Proceedings of NIPS (2006)
Gupta, A., Srinivasan, P., Shi, J., Davis, L.S.: Understanding videos. Learning a visually grounded storyline model from annotated videos. In: Proceedigs of CVPR, constructing plots (2009)
Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: combining multiple features for human action recognition. In: proceedings of ECCV (2010)
Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. PAMI 32(9), 1627–1645 (2010)
Article Google Scholar
Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking. In: Proceedings of CVPR (1999)
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–450 (2002)
MATH Google Scholar
Uijlings, S.J.R.R., Smeulders, A.W.M., Scha, R.J.H.: The visual extent of an object—suppose we know the object locations. In: Proceedings of IJCV (2012)
Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. PAMI 22(1):4–37 (2000)
Google Scholar
Li, X.: Snoek, C.G.M., Worring, M., Smeulders, A.W.M.: Social negative bootstrapping for visual categorization. In: Proceedings of international conference on multimedia retrieval (ICMR), Trento, Italy, April 2011
Iyengar, G., Nock, H., Neti, C.: Discriminative model fusion for semantic concept detection and annotation in video. ACM Multimedia, Berkeley, pp. 255–258 (2003)
Aytar, Y., Orhan, O.B., Shah M.: Improving semantic concept detection and retrieval using contextual estimates. In: Proceedings of ICME, pp. 536–539 (2007)
Naphade, R.Yan.M.: Semi-supervised cross feature learning for semantic concept detection in videos. In: Proceedings of CVPR (2005)
Vedaldi, A., Gulshan, V., Varma, M.: Zisserman. Multiple Kernels for object detection. In: Proceedings of ICCV (2009)
Breiman, L.: Random forests. Mach. Learn. 45(1)5–32 (2001)
Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: a comprehensive study. IJCV 73(2):213–238 (2007)
Google Scholar
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm (2001)
van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for object and scene recognition. PAMI 32(9):1582–1596 (2010)
Google Scholar
Liu, H., Feris, R., Sun, M.T.: Benchmarking human activity recognition. CVPR Tutorial, CVPR (2012)
Yuan, J., Liu, Z., Wu Y.: Discriminative subvolume search for efficient action detection. In: Proceedings of conference on computer vision and pattern recognition, 20–25 June 2009 at Kyoto, Japan (2009)
Niebles, J.C., Chen, C., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Proceedings of ECCV (2010)
Weinland, D., Boyer, E., Ronfard, R.: Action recognition from arbitrary views using 3D exemplars. In: Proceedings of ICCV (2007)
Ryoo, M.S., Chen, C.-C., Aggarwal, J.K.: Roy-Chowdhury. An overview of contest on semantic description of human activities. In: Proceedings of ICPR (2010)
Wu, X., Xu, D., Duan, L., Luo, J.: Action recognition using context and appearance distribution features. In: Proceedings of CVPR (2011)
Waltisberg, D., Yao, A., Gall, J., Van Gool, L.: Variations of a Hough-voting action recognition system. In: Proceedings of ICPR (2010)
Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: Proceedings of ICCV (2011)
Yuan, J., Liu, Z., Wu, Y.: Discriminative video pattern search for efficient action detection. PAMI 33(9):1728–1743 (2011)
Google Scholar
Burghouts, G.J., Geusebroek, J.M.: Quasi-periodic spatio-temporal filtering. IEEE Trans. Image Process 15(6):1572–1582 (2006)
Google Scholar
Jiang, Y.G., Ye, G., Chang, S., Ellis, D., Loui, A.C.: Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: Proceedings of ACM international conference on multimedia retrieval ICMR (2011)
Yu, G., Yuan, J., Liu, Z.: Propagative Hough voting for human activity recognition. In: Proceedings of ECCV (2012)
Burghouts, G.J., Schutte, K.: Spatio-temporal layout of human actions for improved bag-of-words action detection. Pattern Recogn. Lett. (2013)
Puertas, E., Escalera, S., Pujol, O.: Multi-class multi-scale stacked sequential learning. In: Proceedings of multiple classifier systems, pp. 197–206 (2011)

Download references

Acknowledgments

This work is supported by DARPA (mind’s eye program). The content of the information does not necessarily reflect the position or the policy of the US Government, and no official endorsement should be inferred. The authors acknowledge the CORTEX scientists for their significant contributions to the overall system: S. P. van den Broek, P. Hanckmann, J-W Marck, L. de Penning, J-M ten Hove, S. Landsmeer, C. van Leeuwen, A. Halma, M. Kruithof, S. Korzec, W. Ledegang and R. Wijn. Figure 3 has been contributed by E. Boertjes.

Author information

Authors and Affiliations

Intelligent Imaging, TNO, The Hague, The Netherlands
G. J. Burghouts, K. Schutte, H. Bouma & R. J. M. den Hollander

Authors

G. J. Burghouts
View author publications
You can also search for this author in PubMed Google Scholar
K. Schutte
View author publications
You can also search for this author in PubMed Google Scholar
H. Bouma
View author publications
You can also search for this author in PubMed Google Scholar
R. J. M. den Hollander
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to G. J. Burghouts.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Burghouts, G.J., Schutte, K., Bouma, H. et al. Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos. Machine Vision and Applications 25, 85–98 (2014). https://doi.org/10.1007/s00138-013-0514-0

Download citation

Received: 11 July 2012
Accepted: 17 April 2013
Published: 14 May 2013
Issue Date: January 2014
DOI: https://doi.org/10.1007/s00138-013-0514-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos

Abstract

Access this article

Similar content being viewed by others

Online Action Detection

HMDB51: A Large Video Database for Human Motion Recognition

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos

Abstract

Access this article

Similar content being viewed by others

Online Action Detection

HMDB51: A Large Video Database for Human Motion Recognition

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation