Weakly supervised detection of video events using hidden conditional random fields

  • Kimiaki ShirahamaEmail author
  • Marcin Grzegorzek
  • Kuniaki Uehara
Regular Paper


Multimedia Event Detection (MED) is the task to identify videos in which a certain event occurs. This paper addresses two problems in MED: weakly supervised setting and unclear event structure. The first indicates that since associations of shots with the event are laborious and incur annotator’s subjectivity, training videos are loosely annotated as to whether the event is contained or not. It is unknown which shots are relevant or irrelevant to the event. The second problem is the difficulty of assuming the event structure in advance, due to arbitrary camera and editing techniques. To tackle these problems, we propose a method using a Hidden Conditional Random Field (HCRF) which is a probabilistic discriminative classifier with a set of hidden states. We consider that the weakly supervised setting can be handled using hidden states as the intermediate layer to discriminate between relevant and irrelevant shots to the event. In addition, an unclear structure of the event can be exposed by features of each hidden state and its relation to the other states. Based on the above idea, we optimise hidden states and their relation so as to distinguish training videos containing the event from the others. Also, to exploit the full potential of HCRFs, we establish approaches for training video preparation, parameter initialisation and fusion of multiple HCRFs. Experimental results on TRECVID video data validate the effectiveness of our method.


Multimedia event detection Hidden conditional random fields Weakly supervised setting Unclear event structure 



The research work by Kimiaki Shirahama leading to this article has been funded by the Postdoctoral Fellowship for Research Abroad by Japan Society for the Promotion of Science (JSPS). Also, this work was in part supported by JSPS through Grand-in-Aid for Scientific Research (B): KAKENHI (26280040).


  1. 1.
    Aly R et al (2012) AXES at TRECVid 2012: KIS, INS, and MED. In: Proceedings of TRECVID 2012.
  2. 2.
    Ando R, Shinoda K, Furui S, Mochizuki T (2006) Robust scene recognition using language models for scene contexts. In: Proceedings of MIR 2006, pp 99–106Google Scholar
  3. 3.
    Arijon, D (1976) Grammar of the film language. Silman-James Press, Los AngelesGoogle Scholar
  4. 4.
    Ayache S, Quénot G (2008) Video corpus annotation using active learning. In: Proceedings of ECIR 2008, pp 187–198Google Scholar
  5. 5.
    Barnard M, Odobez J (2005) Sports event recognition using layered HMMs. In: Proceedings of ICME 2005, pp 1150–1153Google Scholar
  6. 6.
    Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140zbMATHMathSciNetGoogle Scholar
  7. 7.
    Cheng H et al. (2012) SRI-Sarnoff AURORA system at TRECVID 2012: Multimedia event detection and recounting. In: Proceedings of TRECVID 2012.
  8. 8.
    Davenport G, Smith TA, Pincever N (1991) Cinematic primitives for multimedia. IEEE Comput Graph Appl 11(4):67–74CrossRefGoogle Scholar
  9. 9.
    Fujisawa M (2012) Bayon—a simple and fast clustering tool.
  10. 10.
    Gemmell DJ, Vin HM, Kandlur DD, Rangan PV, Rowe LA (1995) Multimedia storage servers: a tutorial. IEEE Comput 28(5):40–49CrossRefGoogle Scholar
  11. 11.
    Gunawardana A, Mahajan M, Acero A, Platt JC (2005) Hidden conditional random fields for phone classification. In: Proceedings of INTERSPEECH 2005, pp 1117–1120Google Scholar
  12. 12.
    He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRefGoogle Scholar
  13. 13.
    Inoue N, Wada T, Kamishima Y, Shinoda K, Sato S (2011) TokyoTech+Canon at TRECVID 2011. In: Proceedings of TRECVID 2011.
  14. 14.
    Jiang YG, Bhattacharya S, Chang SF, Shah M (2013) High-level event recognition in unconstrained videos. Int J Multimed Inf Retr 2(2):73–101CrossRefGoogle Scholar
  15. 15.
    Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML 2001, pp 282–289Google Scholar
  16. 16.
    Li W, Yu Q, Divakaran A, Vasconcelos N (2013) Dynamic pooling for complex event recognition. In: Proceedings of ICCV 2013, pp 2728–2735Google Scholar
  17. 17.
    Li X, Snoek CGM (2009) Visual categorization with negative examples for free. In: Proceedings of MM 2009, pp 661–664Google Scholar
  18. 18.
    Liu J, McCloskey S, Liu Y (2012) Local expert forest of score fusion for video event classification. In: Proceedings of ECCV 2012, pp 397–410Google Scholar
  19. 19.
    Mann TP (2006) Numerically stable hidden Markov model implementation., HMM Scaling Tutorial
  20. 20.
    Naphade M et al (2006) Large-scale concept ontology for multimedia. IEEE Multimed 13(3):86–91CrossRefGoogle Scholar
  21. 21.
    Quattoni A, Wang S, Morency L, Collins M, Darrell T (2007) Hidden conditional random fields. IEEE Trans Pattern Anal Mach Intell 29(10):1848–1852CrossRefGoogle Scholar
  22. 22.
    Rui Y, Huang TS, Mehrotra S (1999) Constructing table-of-content for videos. Multimed Syst 7(5):359–368CrossRefGoogle Scholar
  23. 23.
    Shirahama K, Uehara K (2008) A novel topic extraction method based on bursts in video streams. Int J Hybrid Inf Technol 1(3):21–32Google Scholar
  24. 24.
    Shirahama K, Uehara K (2012) Kobe university and Muroran institute of technology at TRECVID 2012 semantic indexing task. In: Proceedings of TRECVID 2012.
  25. 25.
    Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: Proceedings of MIR 2006, pp 321–330Google Scholar
  26. 26.
    Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of CIKM 2007, pp 623–632Google Scholar
  27. 27.
    Snoek CGM, Worring M (2009) Concept-based video retrieval. Found Trends Inf Retr 2(4):215–322CrossRefGoogle Scholar
  28. 28.
    Strassel, S et al. (2012) Creating HAVIC: heterogeneous audio visual internet collection. In: Proceedings of LREC 2012, pp 2573–2577Google Scholar
  29. 29.
    Sun C, Nevatia R (2013) ACTIVE: activity concept transitions in video event classification. In: Proceedings of ICCV 2013, pp 913–920Google Scholar
  30. 30.
    Tanaka K, Ariki Y, Uehara K (1999) Organization and retrieval of video data. IEICE Trans Inf Syst 82(1):34–44Google Scholar
  31. 31.
    Vahdat A, Cannons K, Mori G, Oh S, Kim I (2013) Compositional models for video event detection: a multiple kernel learning latent variable approach. In: Proceedings of ICCV 2013, pp 1185– 1192Google Scholar
  32. 32.
    Wang SB, Quattoni A, Morency L, Demirdjian D, Darrell T (2006a) Hidden conditional random fields for gesture recognition. In: Proceedings of CVPR 2006, pp 1521–1527Google Scholar
  33. 33.
    Wang T, Li J, Diao Q, Hu W, Zhang Y, Dulong C (2006b) Semantic event detection using conditional random fields. In: Proceedings of CVPRW 2006Google Scholar
  34. 34.
    Yin J, Hu DH, Yang Q (2009) Spatio-temporal event detection using dynamic conditional random fields. In: Proceedings of IJCAI 2009, pp 1321–1326Google Scholar
  35. 35.
    Young S et al (2009) The HTK Book (for HTK Version 3.4). Cambridge University Engineering Department.
  36. 36.
    Yu H, Han J, Chang KC (2004) PEBL: Web page classification without negative examples. IEEE Trans Knowl Data Eng 16(1):70–81CrossRefGoogle Scholar
  37. 37.
    Zhai Y, Rasheed Z, Shah M (2004) A framework for semantic classification of scenes using finite state machines. In: Proceedings of CIVR 2004, pp 279–288Google Scholar
  38. 38.
    Zhang J, Gong S (2010) Action categorization with modified hidden conditional random field. Pattern Recognit 43(1):197–203CrossRefzbMATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  • Kimiaki Shirahama
    • 1
    Email author
  • Marcin Grzegorzek
    • 1
  • Kuniaki Uehara
    • 2
  1. 1.Pattern Recognition GroupUniversity of SiegenSiegenGermany
  2. 2.Graduate School of System InformaticsKobe UniversityKobeJapan

Personalised recommendations