Skip to main content
Log in

Max-Margin Early Event Detectors

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The need for early detection of temporal events from sequential data arises in a wide spectrum of applications ranging from human-robot interaction to video security. While temporal event detection has been extensively studied, early detection is a relatively unexplored problem. This paper proposes a maximum-margin framework for training temporal event detectors to recognize partial events, enabling early detection. Our method is based on Structured Output SVM, but extends it to accommodate sequential data. Experiments on datasets of varying complexity, for detecting facial expressions, hand gestures, and human activities, demonstrate the benefits of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. www-01.ibm.com/software/integration/optimization/cplex-optimizer/.

  2. http://www.robots.ox.ac.uk/~minhhoai/projects/mmed.html.

References

  • Ali, S., & Shah, M. (2010). Human action recognition in videos using kinematic features and multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(2), 288–303.

    Article  Google Scholar 

  • Amer, M. R., Xie, D., Zhao, M., Todorovic, S., & Zhu, S. C. (2012). Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In Proceedings of the european conference on computer vision.

  • Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257–267.

    Article  Google Scholar 

  • Brand, M., Oliver, N., & Pentland, A. (1997). Coupled hidden Markov models for complex action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Brendel, W., & Todorovic, S. (2011). Learning spatiotemporal graphs of human activities. In Proceedings of the international conference on computer vision.

  • Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, V. J. D., & Lai, J. C. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467–479.

    Google Scholar 

  • Chomat, O., & Crowley, J. (1999). Probabilistic recognition of activity using local appearance. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Cohn, J., Simon, T., Matthews, I., Yang, Y., Nguyen, M. H., Tejera, M., Zhou, F., & De la Torre, F. (2009). Detecting depression from facial actions and vocal prosody. In Proceedings of international conference on affective computing and intelligent interaction.

  • Cooper, H., & Bowden, R. (2009). Learning signs from subtitles: A weakly supervised approach to sign language recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2, 265–292.

    Google Scholar 

  • Davis, J., & Tyagi, A. (2006). Minimal-latency human action recognition using reliable-inference. Image and Vision Computing, 24(5), 455–472.

    Article  Google Scholar 

  • Desobry, F., Davy, M., & Doncarli, C. (2005). An online kernel change detection algorithm. IEEE Transaction on Signal Processing, 53(8), 2961–2974.

    Article  MathSciNet  Google Scholar 

  • Dollár, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In ICCV Workshop on visual surveillance and performance evaluation of tracking and surveillance.

  • Duchenne, O., Laptev, I., Sivic, J., Bach, F. R., & Ponce, J. (2009). Automatic annotation of human actions in video. In Proceedings of the international conference on computer vision.

  • Efros, A., Berg, A., Mori, G., & Malik, J. (2003). Recognizing action at a distance. In Proceedings of the international conference on computer vision.

  • Ellis, C., Masood, S., Tappen, M. F., LaViola, J. J., & Sukthankar, R. (2013). Exploring the trade-off between accuracy and observational latency in action recognition. International Journal of Computer Vision, 101(3), 420–436.

    Article  Google Scholar 

  • Fawcett, T., & Provost, F. (1999). Activity monitoring: Noticing interesting changes in behavior. In Proceedings of the SIGKDD conference on knowledge discovery and data mining.

  • Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12), 2247–2253.

    Article  Google Scholar 

  • Haider, P., Brefeld, U., & Scheffer, T. (2007). Supervised clustering of streaming data for email batch detection. In Proceedings of the international conference on machine learning.

  • Hoai, M., & De la Torre, F. (2012a). Max-margin early event detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Hoai, M., & De la Torre, F. (2012b). Maximum margin temporal clustering. In Proceedings of international conference on artificial intelligence and statistics.

  • Hoai, M., Lan, Z. Z., & De la Torre, F. (2011). Joint segmentation and classification of human actions in video. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Jhuang, H., Serre, T., Wolf, L., & Poggio, T. (2007). A biologically inspired system for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Kadous, M. (2002). Temporal classification: Extending the classification paradigm to multivariate time series. PhD thesis, The University of New South Wales.

  • Ke, Y., Sukthankar, R., & Hebert, M. (2005). Efficient visual event detection using volumetric features. In Proceedings of the international conference on computer vision.

  • Kim, K. J. (2003). Financial time series forecasting using support vector machines. Neurocomputing, 55(1–2), 307–319.

    Article  Google Scholar 

  • Klaser, A., Marszalek, M., Schmid, C., & Zisserman, A. (2010). Human focused action localization in video. In Proceedings of international workshop on sign, gesture, activity.

  • Lan, T., Wang, Y., & Mori, G. (2011). Discriminative figure-centric models for joint action localization and recognition. In Proceedings of the international conference on computer vision.

  • Le, Q. V., Sarlos, T., & Smola, A. (2013). Fastfood—approximating kernel expansions in loglinear time. In Proceedings of the international conference on machine learning.

  • Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar, Z., & Matthews, I. (2010). The extended Cohn–Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression. In CVPR Workshop on human communicative behavior analysis.

  • Maji, S., & Berg, A. C. (2009). Max-margin additive classifiers for detection. In Proceedings of the international conference on computer vision.

  • Marin-Jiménez, M. J., Zisserman, A., & Ferrari, V. (2011). “Here’s looking at you, kid”. Detecting people looking at each other in videos. In Proceedings of the British machine vision conference.

  • Masood, S., Ellis, C., Nagaraja, A., & Tappen, M. (2011). Measuring and reducing observational latency when recognizing actions. In Proceedings of the international conference on computer vision.

  • Mauthner, T., Roth, P., & Bischof, H. (2009). Action recognition from a small number of frames. In Computer vision winter workshop.

  • Nam, Y., Wohn, K., & Lee-Kwang, H. (1999). Modeling and recognition of hand gesture using colored petri nets. IEEE Transactions on Systems, Man and Cybernetics, 29(5), 514–521.

    Article  Google Scholar 

  • Neill, D., Moore, A., & Cooper, G. (2006). A bayesian spatial scan statistic. In Advances in neural information processing systems.

  • Nguyen, M. H., Torresani, L., De la Torre, F., & Rother, C. (2009). Weakly supervised discriminative localization and classification: A joint learning process. In Proceedings of the international conference on computer vision.

  • Nguyen, M. H., Simon, T., De la Torre, F., & Cohn, J. (2010). Action unit detection with segment-based SVMs. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Niebles, J. C., Chen, C. W., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In Proceedings of the european conference on computer vision.

  • Nowozin, S., & Shotton, J. (2012). Action points: A representation for low-latency online human action recognition. Microsoft Research Technical Report MSR-TR-2012-68, Cambridge.

  • Oh, S. M., Rehg, J. M., Balch, T., & Dellaert, F. (2008). Learning and inferring motion patterns using parametric segmental switching linear dynamic systems. International Journal of Computer Vision, 77(1–3), 103–124.

    Article  Google Scholar 

  • Parameswaran, V., & Chellappa, R. (2006). View invariance for human action recognition. International Journal of Computer Vision, 66(1), 83–101.

    Article  Google Scholar 

  • Patron-Perez, A., Marszalek, M., Zisserman, A., & Reid, I. (2010). High five: Recognising human interactions in TV shows. In Proceedings of British machine vision conference.

  • Pei, M., Jia, Y., & Zhu, S. C. (2011). Parsing video events with goal inference and intent prediction. In Proceedings of the international conference on computer vision.

  • Reddy, K. K., & Shah, M. (2012). Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5), 971–981.

    Article  Google Scholar 

  • Ryoo, M. (2011). Human activity prediction: Early recognition of ongoing activities from streaming videos. In Proceedings of the international conference on computer vision.

  • Ryoo, M. S., & Aggarwal, J. K. (2009). Semantic representation and recognition of continued and recursive human activities. International Journal of Computer Vision, 32(1), 1–24.

    Article  Google Scholar 

  • Satkin, S., & Hebert, M. (2010). Modeling the temporal extent of actions. In Proceedings of the european conference on computer vision.

  • Schindler, K., & Van Gool, L. (2008). Action snippets: How many frames does human action recognition require? In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Shi, Y., Nguyen, M. H., Blitz, P., French, B., Fisk, S., De la Torre, F., Smailagic, A., & Siewiorek, D. (2010). Personalized stress detection from physiological measurements. In International symposium on quality of life technology.

  • Smith, P., da Vitoria Lobo, N., & Shah, M. (2005). Temporal boost for event recognition. In Proceedings of the international conference on computer vision.

  • Taskar, B., Guestrin, C., & Koller, D. (2003). Max-margin Markov networks. In Advances in neural information processing systems.

  • Tran, S. D., & Davis, L. S. (2008). Event modeling and recognition using Markov logic networks. In Proceedings of the european conference on computer vision.

  • Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.

    MATH  MathSciNet  Google Scholar 

  • Vedaldi, A., & Zisserman, A. (2009). Structured output regression for detection with partial truncation. In Advances in neural information processing systems.

  • Vedaldi, A., & Zisserman, A. (2010). Efficient additive kernels via explicit feature maps. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Yacoob, Y., & Black, M. J. (1999). Parameterized modeling and recognition of activities. Computer Vision and Image Understanding, 73(2), 232–247.

    Article  Google Scholar 

  • Yang, Y., & Shah, M. (2012). Complex events detection using data-driven concepts. In Proceedings of the european conference on computer vision.

Download references

Acknowledgments

This work was supported by the National Science Foundation (NSF) under Grant No. RI-1116583. The authors would like to thank Yuan Shi for the useful discussion on early detection, Lorenzo Torresani for the suggestion of F1 curves, Maxim Makatchev for the discussion about AMOC curves, Tomas Simon for AU data, and Patrick Lucey for providing CAPP features for the Extended Cohn–Kanade dataset. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minh Hoai.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hoai, M., De la Torre, F. Max-Margin Early Event Detectors. Int J Comput Vis 107, 191–202 (2014). https://doi.org/10.1007/s11263-013-0683-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-013-0683-3

Keywords

Navigation