Skip to main content
Log in

Weakly supervised detection of video events using hidden conditional random fields

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

Multimedia Event Detection (MED) is the task to identify videos in which a certain event occurs. This paper addresses two problems in MED: weakly supervised setting and unclear event structure. The first indicates that since associations of shots with the event are laborious and incur annotator’s subjectivity, training videos are loosely annotated as to whether the event is contained or not. It is unknown which shots are relevant or irrelevant to the event. The second problem is the difficulty of assuming the event structure in advance, due to arbitrary camera and editing techniques. To tackle these problems, we propose a method using a Hidden Conditional Random Field (HCRF) which is a probabilistic discriminative classifier with a set of hidden states. We consider that the weakly supervised setting can be handled using hidden states as the intermediate layer to discriminate between relevant and irrelevant shots to the event. In addition, an unclear structure of the event can be exposed by features of each hidden state and its relation to the other states. Based on the above idea, we optimise hidden states and their relation so as to distinguish training videos containing the event from the others. Also, to exploit the full potential of HCRFs, we establish approaches for training video preparation, parameter initialisation and fusion of multiple HCRFs. Experimental results on TRECVID video data validate the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. It is not reasonable to initialise \(\varvec{\theta }_\mathrm{weight}(h_{i})\) as the centre of the \(i\)th cluster because of the difference of value ranges. While \(\varvec{\theta }_\mathrm{weight}(h_{i})\) takes both positive and negative values, the cluster centre does not take negative ones because concept detection scores lie between \(0\) and \(1\).

  2. We also tested PCA to make each dimension (concept) independent of each other, and the normalisation to obtain uniformed dimensions with the mean zero and the variance one. However, neither of them worked well. It can be considered that detection scores for each concept are appropriately biased by the detector, so editing their distribution does not offer improvement.

References

  1. Aly R et al (2012) AXES at TRECVid 2012: KIS, INS, and MED. In: Proceedings of TRECVID 2012. http://www-nlpir.nist.gov/projects/tvpubs/tv12.papers/axes.pdf

  2. Ando R, Shinoda K, Furui S, Mochizuki T (2006) Robust scene recognition using language models for scene contexts. In: Proceedings of MIR 2006, pp 99–106

  3. Arijon, D (1976) Grammar of the film language. Silman-James Press, Los Angeles

  4. Ayache S, Quénot G (2008) Video corpus annotation using active learning. In: Proceedings of ECIR 2008, pp 187–198

  5. Barnard M, Odobez J (2005) Sports event recognition using layered HMMs. In: Proceedings of ICME 2005, pp 1150–1153

  6. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  MathSciNet  Google Scholar 

  7. Cheng H et al. (2012) SRI-Sarnoff AURORA system at TRECVID 2012: Multimedia event detection and recounting. In: Proceedings of TRECVID 2012. http://www-nlpir.nist.gov/projects/tvpubs/tv12.papers/aurora.pdf

  8. Davenport G, Smith TA, Pincever N (1991) Cinematic primitives for multimedia. IEEE Comput Graph Appl 11(4):67–74

    Article  Google Scholar 

  9. Fujisawa M (2012) Bayon—a simple and fast clustering tool. http://code.google.com/p/bayon/

  10. Gemmell DJ, Vin HM, Kandlur DD, Rangan PV, Rowe LA (1995) Multimedia storage servers: a tutorial. IEEE Comput 28(5):40–49

    Article  Google Scholar 

  11. Gunawardana A, Mahajan M, Acero A, Platt JC (2005) Hidden conditional random fields for phone classification. In: Proceedings of INTERSPEECH 2005, pp 1117–1120

  12. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  13. Inoue N, Wada T, Kamishima Y, Shinoda K, Sato S (2011) TokyoTech+Canon at TRECVID 2011. In: Proceedings of TRECVID 2011. http://www-nlpir.nist.gov/projects/tvpubs/tv11.papers/tokyotechcanon.pdf

  14. Jiang YG, Bhattacharya S, Chang SF, Shah M (2013) High-level event recognition in unconstrained videos. Int J Multimed Inf Retr 2(2):73–101

    Article  Google Scholar 

  15. Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML 2001, pp 282–289

  16. Li W, Yu Q, Divakaran A, Vasconcelos N (2013) Dynamic pooling for complex event recognition. In: Proceedings of ICCV 2013, pp 2728–2735

  17. Li X, Snoek CGM (2009) Visual categorization with negative examples for free. In: Proceedings of MM 2009, pp 661–664

  18. Liu J, McCloskey S, Liu Y (2012) Local expert forest of score fusion for video event classification. In: Proceedings of ECCV 2012, pp 397–410

  19. Mann TP (2006) Numerically stable hidden Markov model implementation. http://bozeman.genome.washington.edu/compbio/mbt599_2006/hmm_scaling_revised.pdf, HMM Scaling Tutorial

  20. Naphade M et al (2006) Large-scale concept ontology for multimedia. IEEE Multimed 13(3):86–91

    Article  Google Scholar 

  21. Quattoni A, Wang S, Morency L, Collins M, Darrell T (2007) Hidden conditional random fields. IEEE Trans Pattern Anal Mach Intell 29(10):1848–1852

    Article  Google Scholar 

  22. Rui Y, Huang TS, Mehrotra S (1999) Constructing table-of-content for videos. Multimed Syst 7(5):359–368

    Article  Google Scholar 

  23. Shirahama K, Uehara K (2008) A novel topic extraction method based on bursts in video streams. Int J Hybrid Inf Technol 1(3):21–32

    Google Scholar 

  24. Shirahama K, Uehara K (2012) Kobe university and Muroran institute of technology at TRECVID 2012 semantic indexing task. In: Proceedings of TRECVID 2012. http://www-nlpir.nist.gov/projects/tvpubs/tv12.papers/kobe-muroran.pdf

  25. Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: Proceedings of MIR 2006, pp 321–330

  26. Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of CIKM 2007, pp 623–632

  27. Snoek CGM, Worring M (2009) Concept-based video retrieval. Found Trends Inf Retr 2(4):215–322

    Article  Google Scholar 

  28. Strassel, S et al. (2012) Creating HAVIC: heterogeneous audio visual internet collection. In: Proceedings of LREC 2012, pp 2573–2577

  29. Sun C, Nevatia R (2013) ACTIVE: activity concept transitions in video event classification. In: Proceedings of ICCV 2013, pp 913–920

  30. Tanaka K, Ariki Y, Uehara K (1999) Organization and retrieval of video data. IEICE Trans Inf Syst 82(1):34–44

    Google Scholar 

  31. Vahdat A, Cannons K, Mori G, Oh S, Kim I (2013) Compositional models for video event detection: a multiple kernel learning latent variable approach. In: Proceedings of ICCV 2013, pp 1185– 1192

  32. Wang SB, Quattoni A, Morency L, Demirdjian D, Darrell T (2006a) Hidden conditional random fields for gesture recognition. In: Proceedings of CVPR 2006, pp 1521–1527

  33. Wang T, Li J, Diao Q, Hu W, Zhang Y, Dulong C (2006b) Semantic event detection using conditional random fields. In: Proceedings of CVPRW 2006

  34. Yin J, Hu DH, Yang Q (2009) Spatio-temporal event detection using dynamic conditional random fields. In: Proceedings of IJCAI 2009, pp 1321–1326

  35. Young S et al (2009) The HTK Book (for HTK Version 3.4). Cambridge University Engineering Department. http://htk.eng.cam.ac.uk/

  36. Yu H, Han J, Chang KC (2004) PEBL: Web page classification without negative examples. IEEE Trans Knowl Data Eng 16(1):70–81

    Article  Google Scholar 

  37. Zhai Y, Rasheed Z, Shah M (2004) A framework for semantic classification of scenes using finite state machines. In: Proceedings of CIVR 2004, pp 279–288

  38. Zhang J, Gong S (2010) Action categorization with modified hidden conditional random field. Pattern Recognit 43(1):197–203

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgments

The research work by Kimiaki Shirahama leading to this article has been funded by the Postdoctoral Fellowship for Research Abroad by Japan Society for the Promotion of Science (JSPS). Also, this work was in part supported by JSPS through Grand-in-Aid for Scientific Research (B): KAKENHI (26280040).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kimiaki Shirahama.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shirahama, K., Grzegorzek, M. & Uehara, K. Weakly supervised detection of video events using hidden conditional random fields. Int J Multimed Info Retr 4, 17–32 (2015). https://doi.org/10.1007/s13735-014-0068-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13735-014-0068-6

Keywords

Navigation