Blind late fusion in multimedia event retrieval

  • Maaike H. T. de BoerEmail author
  • Klamer Schutte
  • Hao Zhang
  • Yi-Jie Lu
  • Chong-Wah Ngo
  • Wessel Kraaij
Regular Paper


One of the challenges in Multimedia Event Retrieval is the integration of data from multiple modalities. A modality is defined as a single channel of sensory input, such as visual or audio. We also refer to this as data source. Previous research has shown that the integration of different data sources can improve performance compared to only using one source, but a clear insight of success factors of alternative fusion methods is still lacking. We introduce several new blind late fusion methods based on inversions and ratios of the state-of-the-art blind fusion methods and compare performance in both simulations and an international benchmark data set in multimedia event retrieval named TRECVID MED. The results show that five of the proposed methods outperform the state-of-the-art methods in a case with sufficient training examples (100 examples). The novel fusion method named JRER is not only the best method with dependent data sources, but this method is also a robust method in all simulations with sufficient training examples.


Multimedia event retrieval Multimodal Integration Late fusion 



We would like to thank the TNO Early Research Program Making Sense of Big Data (MSoBD) for financial support. The work described in this paper was supported in part by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU 120213).


  1. 1.
    Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimed syst 16(6):345–379CrossRefGoogle Scholar
  2. 2.
    Cremer F, Schutte K, Schavemaker JG, den Breejen E (2001) A comparison of decision-level sensor-fusion methods for anti-personnel landmine detection. Inf Fusion 2(3):187–208CrossRefGoogle Scholar
  3. 3.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proc. of Int. Conf. on Multimedia. ACM, pp 675–678Google Scholar
  4. 4.
    Jiang YG, Bhattacharya S, Chang S-F, Shah MI (2012) High-level event recognition in unconstrained videos. Int J Multimed Inf Retr 1–29Google Scholar
  5. 5.
    Jiang Y-G, Wu Z, Wang J, Xue X, Chang S-F (2015) Exploiting feature and class relationships in video categorization with regularized deep neural networks. In: arXiv preprint arXiv:1502.07209
  6. 6.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPR. IEEE, pp 1725–1732Google Scholar
  7. 7.
    Kittler J, Hatef M, Duin RP, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239CrossRefGoogle Scholar
  8. 8.
    Kraaij W, Westerveld T, Hiemstra D (2002) The importance of prior probabilities for entry page search. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 27–34Google Scholar
  9. 9.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 1097–1105Google Scholar
  10. 10.
    Lan Z-Z, Bao L, Yu S-I, Liu W, Hauptmann AG (2012) Double fusion for multimedia event detection. In: Advances in multimedia modeling. Springer, pp 173–185Google Scholar
  11. 11.
    Lewis DD (1998) Naive (bayes) at forty: the independence assumption in information retrieval. In: European conference on machine learning. Springer, pp 4–15Google Scholar
  12. 12.
    Ma AJ, Yuen PC, Lai J-H (2013) Linear dependency modeling for classifier fusion and feature combination. IEEE Trans Pattern Anal Mach Intell 35(5):1135–1148CrossRefGoogle Scholar
  13. 13.
    Mc Donald K, Smeaton AF (2005) A comparison of score, rank and probability-based fusion methods for video shot retrieval. In: International Conference on Image and Video Retrieval. Springer, pp 61–70Google Scholar
  14. 14.
    Mladenić D (1998) Feature subset selection in text-learning. In: European Conference on Machine Learning. Springer, pp 95–100Google Scholar
  15. 15.
    Mukaka M (2012) A guide to appropriate use of correlation coefficient in medical research. Malawi Med J 24(3):69–71Google Scholar
  16. 16.
    Myers GK, Nallapati R, van Hout J, Pancoast S, Nevatia R, Sun C, Habibian A, Koelma DC, van de Sande KE, Smeulders AW et al (2014) Evaluating multimedia features and fusion for example-based event detection. Mach Vis Appl 25(1):17–32CrossRefGoogle Scholar
  17. 17.
    Natarajan P, Wu S, Luisier F, Zhuang X, Tickoo M (2013) BBN VISER TRECVID 2013 multimedia event detection and multimedia event recounting systems. In: NIST TRECVID workshopGoogle Scholar
  18. 18.
    Natarajan P, Wu S, Vitaladevuni S, Zhuang X, Tsakalidis S, Park U, Prasad R (2012) Multimodal feature fusion for robust event detection in web videos. In: CVPR. IEEE, pp 1298–1305Google Scholar
  19. 19.
    Oh S, McCloskey S, Kim I, Vahdat A, Cannons KJ, Hajimirsadeghi H, Mori G, Perera AA, Pandey M, Corso JJ (2014) Multimedia event detection with multimodal feature fusion and temporal concept localization. Mach Vis Appl 25(1):49–69CrossRefGoogle Scholar
  20. 20.
    Over P, Awad G, Michel M, Fiscus J, Sanders G, Kraaij W, Smeaton AF, Quenot G, Ordelman R (2015) Trecvid 2015—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proc. TRECVID 2015. NIST, USAGoogle Scholar
  21. 21.
    Platt J et al (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Mar Classifi 10(3):61–74Google Scholar
  22. 22.
    Ravana SD, Moffat A (2009) Score aggregation techniques in retrieval experimentation. In: Proceedings of the Twentieth Australasian Conference on Australasian Database-Volume 92. Australian Computer Society, Inc, pp 57–66Google Scholar
  23. 23.
    Robertson SE, Jones KS (1976) Relevance weighting of search terms. J Am Soc Inf Sci 27(3):129–146CrossRefGoogle Scholar
  24. 24.
    Strassel S, Morris A, Fiscus JG, Caruso C, Lee H, Over P, Fiumara J, Shaw B, Antonishek B, Michel M (2012) Creating havic: heterogeneous audio visual internet collection. In: LREC. Citeseer, pp 2573–2577Google Scholar
  25. 25.
    Tamrakar A, Ali S, Yu Q, Liu J, Javed O, Divakaran A, Cheng H, Sawhney H (2012) Evaluation of low-level features and their combinations for complex event detection in open source videos. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 3681–3688Google Scholar
  26. 26.
    Terrades OR, Valveny E, Tabbone S (2009) Optimal classifier fusion in a non-bayesian probabilistic framework. IEEE Trans Pattern Anal Mach Intell 31(9):1630–1644CrossRefGoogle Scholar
  27. 27.
    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proc. ICCV. IEEE, pp 4489–4497Google Scholar
  28. 28.
    Tulyakov S, Jaeger S, Govindaraju V, Doermann D (2008) Review of classifier combination methods. In: Machine learning in document analysis and recognition. Springer, pp 361–386Google Scholar
  29. 29.
    Van Rijsbergen C (1979) Information retrievalGoogle Scholar
  30. 30.
    Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp 3551–3558Google Scholar
  31. 31.
    Wilkins P, Ferguson P, Smeaton AF (2006) Using score distributions for query-time fusion in multimediaretrieval. In: Proceedings of the 8th ACM international workshop on Multimedia information retrieval. ACM, pp 51–60Google Scholar
  32. 32.
    Xiong Y, Zhu K, Lin D, Tang X (2015) Recognize complex events from static images by fusing deep channels. In: Proc. CVPR, pp 1600–1609Google Scholar
  33. 33.
    Xu L, Krzyzak A, Suen CY (1992) Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans Syst Man Cybern 22(3):418–435CrossRefGoogle Scholar
  34. 34.
    Yu CT, Salton G (1976) Precision weightingan effective automatic indexing method. J ACM (JACM) 23(1):76–88MathSciNetCrossRefzbMATHGoogle Scholar
  35. 35.
    Zhang H, Lu Y-J, de Boer M, ter Haar F, Qiu Z, Schutte K, Kraaij W, Ngo C-W (2015) VIREO-TNO @ TRECVID 2015: multimedia event detection. In: Proc. of TRECVID 2015Google Scholar
  36. 36.
    Zheng L, Wang S, Tian L, He F, Liu Z, Tian Q (2015) Query-adaptive late fusion for image search and person re-identification. In: Computer vision and pattern recognition, vol 1Google Scholar
  37. 37.
    Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Advances in neural information processing systems, pp 487–495Google Scholar

Copyright information

© Springer-Verlag London 2016

Authors and Affiliations

  • Maaike H. T. de Boer
    • 1
    • 2
    Email author
  • Klamer Schutte
    • 1
  • Hao Zhang
    • 3
  • Yi-Jie Lu
    • 3
  • Chong-Wah Ngo
    • 3
  • Wessel Kraaij
    • 4
    • 5
  1. 1.TNOOude WaalsdorperwegAK The HagueThe Netherlands
  2. 2.Radboud UniversityEC NijmegenThe Netherlands
  3. 3.City UniversityKowloon TongHong Kong
  4. 4.TNOThe HagueThe Netherlands
  5. 5.Leiden UniversityLeidenThe Netherlands

Personalised recommendations