Multimodal Fusion: Combining Visual and Textual Cues for Concept Detection in Video

  • Damianos GalanopoulosEmail author
  • Milan Dojchinovski
  • Krishna Chandramouli
  • Tomáš Kliegr
  • Vasileios Mezaris


Visual concept detection is one of the most active research areas in multimedia analysis. The goal of visual concept detection is to assign to each elementary temporal segment of a video, a confidence score for each target concept (e.g. forest, ocean, sky, etc.). The establishment of such associations between the video content and the concept labels is a key step toward semantics-based indexing, retrieval, and summarization of videos, as well as deeper analysis (e.g., video event detection). Due to its significance for the multimedia analysis community, concept detection is the topic of international benchmarking activities such as TRECVID. While video is typically a multi-modal signal composed of visual content, speech, audio, and possibly also subtitles, most research has so far focused on exploiting the visual modality. In this chapter, we introduce fusion and text analysis techniques for harnessing automatic speech recognition (ASR) transcripts or subtitles to improve the results of visual concept detection. Since the emphasis is on late fusion, the introduced algorithms for handling text and the fusion can be used in conjunction with standard algorithms for visual concept detection. We test our techniques on the TRECVID 2012 Semantic indexing (SIN) task dataset, which is made of more than 800 h of heterogeneous videos collected from Internet archives.


Video analysis Visual concept detection Multimodal fusion Automatic speech recognition Text analysis 



This work was supported by the European Commission under contract FP7-287911 LinkedTV.


  1. 1.
    Bao L, Yu SI, Lan ZZ, Overwijk A, Jin Q, Langner B, Garbus M, Burger S, Metze F, Hauptmann A (2011) Informedia@ TRECVID 2011 multimedia event detection, semantic indexing. TRECVID compet 1:107–123Google Scholar
  2. 2.
    Bay H, Tuytelaars T, Van Gool L (2006) SURF: speeded up robust features. In: Computer vision-ECCV 2006. Springer, Heidelberg, pp 404–417Google Scholar
  3. 3.
    Cernekova Z, Pitas I, Nikou C (2006) Information theory-based shot cut/fade detection and video summarization. IEEE Trans Circuits Syst Video Technol 16(1):82–91CrossRefGoogle Scholar
  4. 4.
    Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at
  5. 5.
    Chavez GC, Precioso F, Cord M, Philipp-Foliguet S, Araujo AdA (2006) Shot boundary detection at TRECVID 2006. In: Proceedings of the TREC video retrieval evaluation, p 1–8Google Scholar
  6. 6.
    Delezoide B, Precioso F, Gosselin PH, Redi M, Mérialdo B, Granjon L, Pellerin D, Rombaut M, Jégou H, Vieux R et al (2011) IRIM at TRECVID 2011: semantic indexing and instance search. In: Notebook papers of the TREC video retrieval evaluation workshop (TRECVID)Google Scholar
  7. 7.
    Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874zbMATHGoogle Scholar
  8. 8.
    Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. IJCAI 7:1606–1611Google Scholar
  9. 9.
    Gauvain JL, Lamel L, Adda G (2002) The LIMSI broadcast news transcription system. Speech Commun 37(1):89–108CrossRefzbMATHGoogle Scholar
  10. 10.
    Hamadi A, Mulhem P, Quénot G (2013) Conceptual feedback for semantic multimedia indexing. In: Proceedings of the 11th international workshop on content-based multimedia indexing (CBMI). IEEE, pp 53–58Google Scholar
  11. 11.
    Harris C, Stephens M (1988) A combined corner and edge detector. In: Alvey vision conference, vol 15. Manchester, p 50Google Scholar
  12. 12.
    Kliegr T, Chandramouli K, Nemrava J, Svatek V, Izquierdo E (2008) Combining image captions and visual analysis for image concept classification. In: Proceedings of the 9th international workshop on multimedia data mining: held in conjunction with the ACM SIGKDD 2008, MDM ’08ACM, New York, pp 8–17Google Scholar
  13. 13.
    Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2. IEEE, pp 2169–2178Google Scholar
  14. 14.
    Leong CW, Mihalcea R, Hassan S (2010) Text mining for automatic image tagging. In: Proceedings of the 23rd international conference on computational linguistics: posters. Association for Computational Linguistics, pp 647–655Google Scholar
  15. 15.
    Lin WH, Hauptmann A (2002) News video classification using SVM-based multimodal classifiers and combination strategies. In: Proceedings of the 10th ACM international conference on multimedia. ACM, pp 323–326Google Scholar
  16. 16.
    Liu C, Liu H, Jiang S, Huang Q, Zheng Y, Zhang W (2006) JDL at TRECVID 2006 shot boundary detection. In: TRECVID 2006 workshopGoogle Scholar
  17. 17.
    Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the 7th IEEE international conference on computer vision, vol 2. IEEE, pp 1150–1157Google Scholar
  18. 18.
    Markatopoulou F, Moumtzidou A, Tzelepis C, Avgerinakis K, Gkalelis N, Vrochidis S, Mezaris V, Kompatsiaris I (2013) ITI-CERTH participation to TRECVID 2013. In: Proceedings of TRECVID 2013 workshop. TRECVID 2013Google Scholar
  19. 19.
    Mittal A, Cheong LF (2004) Addressing the problems of Bayesian network classification of video using high-dimensional features. IEEE Trans Knowl Data Eng 16(2):230–244CrossRefGoogle Scholar
  20. 20.
    Moumtzidou A, Gkalelis N, Sidiropoulos P, Dimopoulos M, Nikolopoulos S, Vrochidis S, Mezaris V, Kompatsiaris I (2012) ITI-CERTH participation to TRECVID 2012. In: Proceedings of TRECVID 2012 workshop. TRECVID 2012Google Scholar
  21. 21.
    Over P, Awad G, Michel M, Fiscus J, Sanders G, Kraaij W, Smeaton AF, Quénot G (2013) TRECVID 2013—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2013. NISTGoogle Scholar
  22. 22.
    Over P, Awad G, Michel M, Fiscus J, Sanders G, Shaw B, Kraaij W, Smeaton AF, Quénot G (2012) TRECVID 2012—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2012. NISTGoogle Scholar
  23. 23.
    Quénot G, Moraru D, Besacier L (2003) Clips at TRECVID: shot boundary detection and feature detection. In: TRECVID 2003 workshop notebook papers. CiteseerGoogle Scholar
  24. 24.
    Radinsky K, Agichtein E, Gabrilovich E, Markovitch S (2011) A word at a time: computing word relatedness using temporal semantic analysis. In: Proceedings of the 20th international conference on world wide web. ACM, pp 337–346Google Scholar
  25. 25.
    Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523CrossRefGoogle Scholar
  26. 26.
    Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47CrossRefGoogle Scholar
  27. 27.
    Sechidis K, Tsoumakas G, Vlahavas I (2011) On the stratification of multi-label data. In: Machine learning and knowledge discovery in databases. Springer, Berlin, pp 145–158Google Scholar
  28. 28.
    Sidiropoulos P, Mezaris V, Kompatsiaris I (2013) Enhancing video concept detection with the use of tomographs. In: Proceedings of the 20th IEEE international conference on image processing (ICIP), pp 3991–3995Google Scholar
  29. 29.
    Tsamoura E, Mezaris V, Kompatsiaris I (2008) Gradual transition detection using color coherence and other criteria in a video shot meta-segmentation framework. In: Proceedings of the 15th IEEE international conference on image processing (ICIP), pp 45–48Google Scholar
  30. 30.
    Van De Sande KE, Gevers T, Snoek CG (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596CrossRefGoogle Scholar
  31. 31.
    Wan KW, Yau WY, Roy S (2013) Metadata enrichment for news video retrieval: a graph-based propagation approach. In: Proceedings of the 21st ACM international conference on multimedia. ACM, pp 373–376Google Scholar
  32. 32.
    Witten I, Milne D (2008) An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Proceeding of AAAI workshop on Wikipedia and artificial intelligence: an evolving synergy. AAAI Press, Chicago, pp 25–30Google Scholar
  33. 33.
    Yilmaz E, Kanoulas E, Aslam JA (2008) A simple and efficient sampling method for estimating AP and NDCG. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 603–610Google Scholar
  34. 34.
    Zhao ZC, Cai AN (2006) Shot boundary detection algorithm in compressed domain based on adaboost and fuzzy theory. In: Advances in natural computation. Springer, Berlin, pp 617–626Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Damianos Galanopoulos
    • 1
    Email author
  • Milan Dojchinovski
    • 2
    • 3
  • Krishna Chandramouli
    • 3
  • Tomáš Kliegr
    • 4
  • Vasileios Mezaris
    • 1
  1. 1.Centre for Research and Technology HellasInformation Technologies InstituteThermi-ThessalonikiGreece
  2. 2.Web Engineering Group, Faculty of Information TechnologyCzech Technical University in PraguePragueCzech Republic
  3. 3.Department of Information and Knowledge Engineering, Faculty of Informatics and StatisticsUniversity of EconomicsPragueCzech Republic
  4. 4.Division of Enterprise and Cloud ComputingVIT UniversityVelloreIndia

Personalised recommendations