Skip to main content

Blind late fusion in multimedia event retrieval


One of the challenges in Multimedia Event Retrieval is the integration of data from multiple modalities. A modality is defined as a single channel of sensory input, such as visual or audio. We also refer to this as data source. Previous research has shown that the integration of different data sources can improve performance compared to only using one source, but a clear insight of success factors of alternative fusion methods is still lacking. We introduce several new blind late fusion methods based on inversions and ratios of the state-of-the-art blind fusion methods and compare performance in both simulations and an international benchmark data set in multimedia event retrieval named TRECVID MED. The results show that five of the proposed methods outperform the state-of-the-art methods in a case with sufficient training examples (100 examples). The novel fusion method named JRER is not only the best method with dependent data sources, but this method is also a robust method in all simulations with sufficient training examples.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16


  1. Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimed syst 16(6):345–379

    Article  Google Scholar 

  2. Cremer F, Schutte K, Schavemaker JG, den Breejen E (2001) A comparison of decision-level sensor-fusion methods for anti-personnel landmine detection. Inf Fusion 2(3):187–208

    Article  Google Scholar 

  3. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proc. of Int. Conf. on Multimedia. ACM, pp 675–678

  4. Jiang YG, Bhattacharya S, Chang S-F, Shah MI (2012) High-level event recognition in unconstrained videos. Int J Multimed Inf Retr 1–29

  5. Jiang Y-G, Wu Z, Wang J, Xue X, Chang S-F (2015) Exploiting feature and class relationships in video categorization with regularized deep neural networks. In: arXiv preprint arXiv:1502.07209

  6. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPR. IEEE, pp 1725–1732

  7. Kittler J, Hatef M, Duin RP, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239

    Article  Google Scholar 

  8. Kraaij W, Westerveld T, Hiemstra D (2002) The importance of prior probabilities for entry page search. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 27–34

  9. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 1097–1105

  10. Lan Z-Z, Bao L, Yu S-I, Liu W, Hauptmann AG (2012) Double fusion for multimedia event detection. In: Advances in multimedia modeling. Springer, pp 173–185

  11. Lewis DD (1998) Naive (bayes) at forty: the independence assumption in information retrieval. In: European conference on machine learning. Springer, pp 4–15

  12. Ma AJ, Yuen PC, Lai J-H (2013) Linear dependency modeling for classifier fusion and feature combination. IEEE Trans Pattern Anal Mach Intell 35(5):1135–1148

    Article  Google Scholar 

  13. Mc Donald K, Smeaton AF (2005) A comparison of score, rank and probability-based fusion methods for video shot retrieval. In: International Conference on Image and Video Retrieval. Springer, pp 61–70

  14. Mladenić D (1998) Feature subset selection in text-learning. In: European Conference on Machine Learning. Springer, pp 95–100

  15. Mukaka M (2012) A guide to appropriate use of correlation coefficient in medical research. Malawi Med J 24(3):69–71

    Google Scholar 

  16. Myers GK, Nallapati R, van Hout J, Pancoast S, Nevatia R, Sun C, Habibian A, Koelma DC, van de Sande KE, Smeulders AW et al (2014) Evaluating multimedia features and fusion for example-based event detection. Mach Vis Appl 25(1):17–32

    Article  Google Scholar 

  17. Natarajan P, Wu S, Luisier F, Zhuang X, Tickoo M (2013) BBN VISER TRECVID 2013 multimedia event detection and multimedia event recounting systems. In: NIST TRECVID workshop

  18. Natarajan P, Wu S, Vitaladevuni S, Zhuang X, Tsakalidis S, Park U, Prasad R (2012) Multimodal feature fusion for robust event detection in web videos. In: CVPR. IEEE, pp 1298–1305

  19. Oh S, McCloskey S, Kim I, Vahdat A, Cannons KJ, Hajimirsadeghi H, Mori G, Perera AA, Pandey M, Corso JJ (2014) Multimedia event detection with multimodal feature fusion and temporal concept localization. Mach Vis Appl 25(1):49–69

    Article  Google Scholar 

  20. Over P, Awad G, Michel M, Fiscus J, Sanders G, Kraaij W, Smeaton AF, Quenot G, Ordelman R (2015) Trecvid 2015—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proc. TRECVID 2015. NIST, USA

  21. Platt J et al (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Mar Classifi 10(3):61–74

    Google Scholar 

  22. Ravana SD, Moffat A (2009) Score aggregation techniques in retrieval experimentation. In: Proceedings of the Twentieth Australasian Conference on Australasian Database-Volume 92. Australian Computer Society, Inc, pp 57–66

  23. Robertson SE, Jones KS (1976) Relevance weighting of search terms. J Am Soc Inf Sci 27(3):129–146

    Article  Google Scholar 

  24. Strassel S, Morris A, Fiscus JG, Caruso C, Lee H, Over P, Fiumara J, Shaw B, Antonishek B, Michel M (2012) Creating havic: heterogeneous audio visual internet collection. In: LREC. Citeseer, pp 2573–2577

  25. Tamrakar A, Ali S, Yu Q, Liu J, Javed O, Divakaran A, Cheng H, Sawhney H (2012) Evaluation of low-level features and their combinations for complex event detection in open source videos. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 3681–3688

  26. Terrades OR, Valveny E, Tabbone S (2009) Optimal classifier fusion in a non-bayesian probabilistic framework. IEEE Trans Pattern Anal Mach Intell 31(9):1630–1644

    Article  Google Scholar 

  27. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proc. ICCV. IEEE, pp 4489–4497

  28. Tulyakov S, Jaeger S, Govindaraju V, Doermann D (2008) Review of classifier combination methods. In: Machine learning in document analysis and recognition. Springer, pp 361–386

  29. Van Rijsbergen C (1979) Information retrieval

  30. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp 3551–3558

  31. Wilkins P, Ferguson P, Smeaton AF (2006) Using score distributions for query-time fusion in multimediaretrieval. In: Proceedings of the 8th ACM international workshop on Multimedia information retrieval. ACM, pp 51–60

  32. Xiong Y, Zhu K, Lin D, Tang X (2015) Recognize complex events from static images by fusing deep channels. In: Proc. CVPR, pp 1600–1609

  33. Xu L, Krzyzak A, Suen CY (1992) Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans Syst Man Cybern 22(3):418–435

    Article  Google Scholar 

  34. Yu CT, Salton G (1976) Precision weightingan effective automatic indexing method. J ACM (JACM) 23(1):76–88

    MathSciNet  Article  MATH  Google Scholar 

  35. Zhang H, Lu Y-J, de Boer M, ter Haar F, Qiu Z, Schutte K, Kraaij W, Ngo C-W (2015) VIREO-TNO @ TRECVID 2015: multimedia event detection. In: Proc. of TRECVID 2015

  36. Zheng L, Wang S, Tian L, He F, Liu Z, Tian Q (2015) Query-adaptive late fusion for image search and person re-identification. In: Computer vision and pattern recognition, vol 1

  37. Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Advances in neural information processing systems, pp 487–495

Download references


We would like to thank the TNO Early Research Program Making Sense of Big Data (MSoBD) for financial support. The work described in this paper was supported in part by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU 120213).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Maaike H. T. de Boer.



See appendix 4, 5, 6, and 7.

Table 4 Performance of the late fusion methods for different simulated distributions on a 100Ex case
Table 5 Performance of the late fusion methods for simulated distributions on a 10Ex case
Table 6 %MAP integrating visual and motion features in MED14Test 100Ex
Table 7 %MAP integrating visual and motion features in MED14Test 10Ex

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

de Boer, M.H.T., Schutte, K., Zhang, H. et al. Blind late fusion in multimedia event retrieval. Int J Multimed Info Retr 5, 203–217 (2016).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Multimedia event retrieval
  • Multimodal
  • Integration
  • Late fusion