Hierarchical Late Fusion for Concept Detection in Videos

  • Sabin Tiberius StratEmail author
  • Alexandre Benoit
  • Patrick Lambert
  • Hervé Bredin
  • Georges Quénot
Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR)


Current research shows that the detection of semantic concepts (e.g., animal, bus, person, dancing, etc.) in multimedia documents such as videos, requires the use of several types of complementary descriptors in order to achieve good results. In this work, we explore strategies for combining dozens of complementary content descriptors (or “experts”) in an efficient way, through the use of late fusion approaches, for concept detection in multimedia documents. We explore two fusion approaches that share a common structure: both start with a clustering of experts stage, continue with an intra-cluster fusion and finish with an inter-cluster fusion, and we also experiment with other state-of-the-art methods. The first fusion approach relies on a priori knowledge about the internals of each expert to group the set of available experts by similarity. The second approach automatically obtains measures on the similarity of experts from their output to group the experts using agglomerative clustering, and then combines the results of this fusion with those from other methods. In the end, we show that an additional performance boost can be obtained by also considering the context of multimedia elements.


Late fusion Hierarchical AdaBoost Semantic concepts Video  Semantic indexing  



This work was supported by the Quaero Program and the QCompere project, respectively funded by OSEO (French State agency for innovation) and ANR (French national research agency). The authors would also like to thank the members of the IRIM consortium for the expert scores used throughout the experiments described in this paper.


  1. 1.
    Ayache S, Quénot G, Gensel J (2007) Image and video indexing using networks of operators. J Image Video Process 2007(3):1:1–1:13. doi: 10.1155/2007/56928.
  2. 2.
    Ballas N, Delezoide B, Prêteux F (2011) Trajectories based descriptor for dynamic events annotation. In: Proceedings of the 2011 joint ACM workshop on modeling and representing events, J-MRE ’11. ACM, New York, pp 13–18. doi: 10.1145/2072508.2072512.
  3. 3.
    Ballas N, Labbé B, Shabou A, Borgne L (2012) Cea list at trecvid 2012: semantic indexing and instance search. In: Proceedings of TRECVid workshop, Gaithersburg, 2012Google Scholar
  4. 4.
    Ballas N, Labbé B, Shabou A, Le Borgne H, Gosselin P, Redi M, Merialdo B, Jégou H, Delhumeau J, Vieux R, Mansencal B, Benois-Pineau J, Ayache S, Hamadi A, Safadi B, Thollard F, Derbas N, Quenot G, Bredin H, Cord M, Gao B, Zhu C, Tang Y, Dellandrea E, Bichot CE, Chen L, Benoit A, Lambert P, Strat T, Razik J, Paris S, Glotin H, Trung TN, Petrovska-Delacrétaz D, Chollet G, Stoian A, Crucianu M (2012) IRIM at TRECVid 2012: semantic indexing and instance search. In: Proceedings of the workshop on TREC video retrieval evaluation (TRECVid). Gaithersburg, p 12. CNRS, RENATER, several Universities, other funding bodies (see
  5. 5.
    Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust features (surf). Comput Vis Image Underst 110(3):346–359. doi: 10.1016/j.cviu.2007.09.014. Google Scholar
  6. 6.
    Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech: Theory Exp 2008(10):10008.
  7. 7.
    Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140Google Scholar
  8. 8.
    Cai N, Li M, Lin S, Zhang Y, Tang S (2007) Ap-based adaboost in high level feature extraction at trecvid. In: Proceedings of 2nd international conference on pervasive computing and applications, 2007. ICPCA 2007, pp 194–198. doi: 10.1109/ICPCA.2007.4365438
  9. 9.
    Cao L, Chang SF, Codella N, Cotton C, Ellis D, Gong L, Hill M, Hua G, Kender J, Merler M, Mu Y, Smith JR, Felix XY (2012) Ibm research and columbia university trecvid-2012 multimedia event detection (med), multimedia event recounting (mer), and semantic indexing (sin) systems. In: NIST TRECVid workshop, Gaithersburg, 2012Google Scholar
  10. 10.
    Cliville V, Berrah L, Mauris G (2004) Information fusion in industrial performance: a 2-additive choquet-integral based approach. In: IEEE international conference on systems, man and cybernetics, vol 2, pp 1297–1302. doi: 10.1109/ICSMC.2004.1399804
  11. 11.
    Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: CVPR09, 2009Google Scholar
  12. 12.
    Everingham M, Gool LV, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–38CrossRefGoogle Scholar
  13. 13.
    Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139. doi: 10.1006/jcss.1997.1504.
  14. 14.
    Gönen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268. Google Scholar
  15. 15.
    Gosselin PH, Cord M, Philipp-Foliguet S (2008) Combining visual dictionary, kernel-based similarity and learning strategy for image category retrieval. Comput Vis Image Underst 110(3):403–417. doi: 10.1016/j.cviu.2007.09.018.
  16. 16.
    Hamadi A, Quénot G, Mulhem P (2013) Conceptual feedback for semantic multimedia indexing. In: 11th international workshop on content-based multimedia indexing (CBMI), Veszprém, 2013Google Scholar
  17. 17.
    Kendall MG (1948) Rank correlation methods. Griffin, LondonzbMATHGoogle Scholar
  18. 18.
    Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–23CrossRefGoogle Scholar
  19. 19.
    Little S, Llorente A, Rüger S (2010) An overview of evaluation campaigns in multimedia retrieval. In: Müller H, Clough P, Deselaers T, Caputo B (eds.) ImageCLEF. The information retrieval series, vol 32. Springer, Berlin, pp 507–525. doi: 10.1007/978-3-642-15181-1_27.
  20. 20.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110. doi: 10.1023/B:VISI.0000029664.99615.94. Google Scholar
  21. 21.
    Negrel R, Picard D, Gosselin P (2012) Compact tensor based image representation for similarity search. In: 19th IEEE international conference on image processing (ICIP), 2012, pp 2425–2428. doi: 10.1109/ICIP.2012.6467387
  22. 22.
    Newman MEJ (2006) Modularity and community structure in networks. Proc Nat Acad Sci U.S.A 103(23):8577–8582. doi: 10.1073/pnas.0601602103.
  23. 23.
    Ng KB, Kantor PB (2000) Predicting the effectiveness of naive data fusion on the basis of system characteristics. J Am Soc Inform Sci 51:1177–1189. doi: 10.1002/1097-4571(2000)9999:9999\(\langle \)::AID-ASI1030\(\rangle \)3.0.CO;2-E.
  24. 24.
    Over P, Awad G, Michel M, Fiscus J, Kraaij W, Smeaton AF, Quénot G (2011) Trecvid 2011—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVid 2011. NIST, USA, 2011Google Scholar
  25. 25.
    Over P, Awad G, Michel M, Fiscus J, Sanders G, Kraaij W, Smeaton AF, Quénot G (2013) Trecvid 2013—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2013. NIST, USA 2013Google Scholar
  26. 26.
    Paris S, Glotin H (2010) Pyramidal multi-level features for the robot vision@icpr 2010 challenge. In: 20th International conference on pattern recognition (ICPR), pp 2949–2952. doi: 10.1109/ICPR.2010.1143
  27. 27.
    Pinquier J, Karaman S, Letoupin L, Guyot P, Megret R, Benois-Pineau J, Gaestel Y, Dartigues JF (2012) Strategies for multiple feature fusion with hierarchical hmm: application to activity recognition from wearable audiovisual sensors. In: 21st International conference on pattern recognition (ICPR), pp 3192–3195Google Scholar
  28. 28.
    Redi M, Merialdo B (2011) Saliency moments for image categorization. In: Proceedings of the 1st ACM international conference on multimedia retrieval, ICMR ’11, pp 39:1–39:8. ACM, New York. doi: 10.1145/1991996.1992035.
  29. 29.
    Safadi B, Quénot G (2010) Evaluations of multi-learner approaches for concept indexing in video documents. In: Adaptivity, personalization and fusion of heterogeneous information, RIAO ’10, pp 88–91. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE, Paris, 2010.
  30. 30.
    Safadi B, Quénot G (2011) Re-ranking for multimedia indexing and retrieval. In: ECIR 2011: 33rd european conference on information retrieval. Springer, Dublin, pp 708–711Google Scholar
  31. 31.
    Safadi B, Quénot G (2013) Descriptor optimization for multimedia indexing and retrieval. In: 11th International workshop on content-based multimedia indexing, CBMI 2013, Veszprem, 2013Google Scholar
  32. 32.
    Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105(3):222–245. doi: 10.1007/s11263-013-0636-x. Google Scholar
  33. 33.
    van de Sande KEA, Gevers T, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596. Google Scholar
  34. 34.
    Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336. doi: 10.1023/A:1007614523901. Google Scholar
  35. 35.
    Shabou A, Borgne HL (2012) Locality-constrained and spatially regularized coding for scene categorization. In: CVPR, pp. 3618–3625. IEEE, 2012. #ShabouL12
  36. 36.
    Shafer G (1976) A mathematical theory of evidence. Princeton University Press, PrincetonzbMATHGoogle Scholar
  37. 37.
    Smeaton AF, Over P, Kraaij W (2009) High-level feature detection from video in TRECVid: a 5-year retrospective of achievements. In: Divakaran A (ed) Multimedia content analysis. Theory and applications. Springer, Berlin, pp 151–174Google Scholar
  38. 38.
    Snoek CGM, van de Sande KEA, Habibian A, Kordumova S, Li Z, Mazloom M, Pintea SL, Tao R, Koelma DC, Smeulders AWM (2012) The mediamill trecvid 2012 semantic video search engine. In: Proceedings of the TRECVid workshop.
  39. 39.
    Strat S, Benoit A, Lambert P (2013) Retina enhanced sift descriptors for video indexing. In: 11th International workshop on content-based multimedia indexing (CBMI), pp. 201–206. doi: 10.1109/CBMI.2013.6576582
  40. 40.
    Strat S, Benoit A, Lambert P, Caplier A (2012) Retina-enhanced surf descriptors for semantic concept detection in videos. In: 3rd International conference on image processing theory, tools and applications (IPTA), 2012, pp 319–324. doi: 10.1109/IPTA.2012.6469557
  41. 41.
    Strat ST, Benoit A, Lambert P, Caplier A (2013) Retina enhanced surf descriptors for spatio-temporal concept detection. In: Multimedia tools and applications, pp 1–27. doi: 10.1007/s11042-012-1280-0.
  42. 42.
    Strat T, Benoit A, Bredin H, Quenot G, Lambert P (2012) Hierarchical late fusion for concept detection in videos. In: Andrea Fusiello VMRC (ed.) Proceedings of computer vision—ECCV 2012. workshops and demonstrations, Part III, Lecture notes in computer science (LNCS), vol 7585. Springer, Berlin, pp 335–344. doi: 10.1007/978-3-642-33885-4_34. Oral session 1: WS21—Workshop on information fusion in computer vision for concept recognition OSEO (French State agency for innovation) and ANR (French national research agency)
  43. 43.
    Tang Z, Yanai K (2008) UEC at TRECVID 2008 high level feature task. In: In: Proceedings of the workshop on TREC video retrieval evaluation (TRECVID). Gaithersburg.
  44. 44.
    Wang H, Kläser A, Schmid C, Cheng-Lin L (2011) Action recognition by dense trajectories. In: IEEE conference on computer vision and pattern recognition. Colorado Springs, pp 3169–3176.
  45. 45.
    Wu L, Guo Y, Qiu X, Feng Z, Rong J, Jin W, Zhou D, Wang R, Jin M (2003) Fudan university at trecvid 2003. In: Notebook of TRECVidGoogle Scholar
  46. 46.
    Yilmaz E, Aslam JA (2006) Estimating average precision with incomplete and imperfect judgments. In: Proceedings of the 15th ACM international conference on Information and knowledge management, CIKM ’06, pp 102–111. ACM, New York. doi: 10.1145/1183614.1183633.
  47. 47.
    Yilmaz E, Kanoulas E, Aslam JA (2008) A simple and efficient sampling method for estimating AP and NDCG. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08. ACM, New York, pp 603–610. DOI
  48. 48.
    Zhang L, Jiang L, Bao L, Takahashi S, Li YAH (2011) Informedia@trecvid 2011: Surveillance event detection. In: TRECVid video retrieval evaluation workshop, GaitherburgGoogle Scholar
  49. 49.
    Zhu C, Bichot CE, Chen L (2013) Image region description using orthogonal combination of local binary patterns enhanced with color information. Pattern Recogn. 46(7):1949–1963. doi: 10.1016/j.patcog.2013.01.003.
  50. 50.
    Znaidia A, Borgne HL, Hudelot C (2012) Belief theory for large-scale multi-label image classification. In: Denoeux T, Masson MH (eds.) Belief functions. Advances in soft computing, vol 164. Springer, Berlin, pp 205–212Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Sabin Tiberius Strat
    • 1
    • 2
    Email author
  • Alexandre Benoit
    • 1
  • Patrick Lambert
    • 1
  • Hervé Bredin
    • 3
  • Georges Quénot
    • 4
  1. 1.LISTIC—University of SavoieAnnecyFrance
  2. 2.LAPI—University “POLITEHNICA” of BucharestBucharestRomania
  3. 3.CNRS-LIMSIOrsayFrance
  4. 4.UJF-Grenoble 1 / UPMF-Grenoble 2 / Grenoble INP / CNRSGrenobleFrance

Personalised recommendations