Sequential Max-Margin Event Detectors

  • Dong Huang
  • Shitong Yao
  • Yi Wang
  • Fernando De La Torre
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8691)


Many applications in computer vision (e.g., games, human computer interaction) require a reliable and early detector of visual events. Existing event detection methods rely on one-versus-all or multi-class classifiers that do not scale well to online detection of large number of events. This paper proposes Sequential Max-Margin Event Detectors (SMMED) to efficiently detect an event in the presence of a large number of event classes. SMMED sequentially discards classes until only one class is identified as the detected class. This approach has two main benefits w.r.t. standard approaches: (1) It provides an efficient solution for early detection of events in the presence of large number of classes, and (2) it is computationally efficient because only a subset of likely classes are evaluated. The benefits of SMMED in comparison with existing approaches is illustrated in three databases using different modalities: MSRDaliy Activity (3D depth videos), UCF101 (RGB videos) and the CMU-Multi-Modal Action Detection (MAD) database (depth, RGB and skeleton). The CMU-MAD was recorded to target the problem of event detection (not classification), and the data and labels are available at .


Event Detection Activity Recognition Time Series Analysis Multi-Modal Action Detection 


  1. 1.
    Aggarwal, J., Ryoo, M.: Human activity analysis: A review. ACM Computing Surveys (CSUR) 43(3) (2011)Google Scholar
  2. 2.
    Bengio, S., Weston, J., Grangier, D.: Label embedding trees for large multi-class tasks. In: NIPS (2010)Google Scholar
  3. 3.
    Brand, M., Kettnaker, V.: Discovery and segmentation of activities in video. PAMI 22(8), 844–851 (2000)CrossRefGoogle Scholar
  4. 4.
    Chapelle, O.: Training a support vector machine in the primal. Neural Computation 19(5), 1155–1178 (2007)CrossRefzbMATHMathSciNetGoogle Scholar
  5. 5.
    Crammer, K., Singer, Y.: On the algorithmic implementation of multi-class svms. JMLR, 265–292 (2001)Google Scholar
  6. 6.
    Gall, J., Yao, A., Razavi, N., Van Gool, L., Lempitsky, V.: Hough forests for object detection, tracking, and action recognition. PAMI 33(11), 2188–2202 (2011)CrossRefGoogle Scholar
  7. 7.
    Hoai, M., Lan, Z., De la Torre, F.: Joint segmentation and classification of human actions in video. In: CVPR (2011)Google Scholar
  8. 8.
    Hoai, M., De la Torre, F.: Max-margin early event detectors. In: CVPR (2012)Google Scholar
  9. 9.
    Hongeng, S., Nevatia, R., Bremond, F.: Video-based event recognition: activity representation and probabilistic recognition methods. CVIU 96(2), 129–162 (2004)Google Scholar
  10. 10.
    Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: ICCV (2007)Google Scholar
  11. 11.
    Laptev, I., Marsza, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  12. 12.
    Mitra, S., Acharya, T.: Gesture recognition: A survey. TSMC-C 37(3), 311–324 (2007)Google Scholar
  13. 13.
    Niebles, J., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. IJCV 79(3), 299–318 (2008)CrossRefGoogle Scholar
  14. 14.
    Niu, W., Long, J., Han, D., Wang, Y.F.: Human activity detection and recognition for video surveillance. In: ICME (2004)Google Scholar
  15. 15.
    Oh, S., Rehg, J., Balch, T., Dellaert, F.: Learning and inferring motion patterns using parametric segmental switching linear dynamic systems. IJCV 77, 103–124 (2008)CrossRefGoogle Scholar
  16. 16.
    Simon, T., Nguyen, M., De La Torre, F., Cohn, J.: Action unit detection with segment-based SVMs. In: CVPR (2010)Google Scholar
  17. 17.
    Sminchisescu, C., Kanaujia, A., Li, Z., Metaxas, D.: Conditional models for contextual human motion recognition. In: ICCV (2005)Google Scholar
  18. 18.
    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human action classes from videos in the wild. In: CRCV-TR-12-01 (2012)Google Scholar
  19. 19.
    Swears, E., Hoogs, A.: Learning and recognizing complex multi-agent activities with applications to american football plays. In: IEEE Workshop on the Applications of Computer Vision (2012)Google Scholar
  20. 20.
    Tapia, E., Intille, S., Haskell, W., Larson, K.: Real-time recognition of physical activities and their intensities using wireless accelerometers and a heart rate monitor. In: IEEE Int. Symp. Wearable Computers (2007)Google Scholar
  21. 21.
    Tsochantaridis, I., Joachims, T., Hofmann, T., Atun, Y.: Large margin methods for structured and interdependent output variables. JMLR 6, 1453–1484 (2005)zbMATHGoogle Scholar
  22. 22.
    Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: CVPR (2012)Google Scholar
  23. 23.
    Xia, L., Aggarwal, J.: Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: CVPR (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Dong Huang
    • 1
  • Shitong Yao
    • 1
  • Yi Wang
    • 1
  • Fernando De La Torre
    • 1
  1. 1.Robotics InstituteCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations