Skip to main content

Multimodal Fusion: Combining Visual and Textual Cues for Concept Detection in Video

  • Chapter
  • First Online:

Abstract

Visual concept detection is one of the most active research areas in multimedia analysis. The goal of visual concept detection is to assign to each elementary temporal segment of a video, a confidence score for each target concept (e.g. forest, ocean, sky, etc.). The establishment of such associations between the video content and the concept labels is a key step toward semantics-based indexing, retrieval, and summarization of videos, as well as deeper analysis (e.g., video event detection). Due to its significance for the multimedia analysis community, concept detection is the topic of international benchmarking activities such as TRECVID. While video is typically a multi-modal signal composed of visual content, speech, audio, and possibly also subtitles, most research has so far focused on exploiting the visual modality. In this chapter, we introduce fusion and text analysis techniques for harnessing automatic speech recognition (ASR) transcripts or subtitles to improve the results of visual concept detection. Since the emphasis is on late fusion, the introduced algorithms for handling text and the fusion can be used in conjunction with standard algorithms for visual concept detection. We test our techniques on the TRECVID 2012 Semantic indexing (SIN) task dataset, which is made of more than 800 h of heterogeneous videos collected from Internet archives.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Thus the name Explicit Semantic Analysis—due to the use of natural concepts (Wikipedia articles), the model is easy to explain to human users.

  2. 2.

    The ESAlib implementation obtained from http://ticcky.github.io/esalib/ with ESA background built from Wikipedia snapshot from 2005.

References

  1. Bao L, Yu SI, Lan ZZ, Overwijk A, Jin Q, Langner B, Garbus M, Burger S, Metze F, Hauptmann A (2011) Informedia@ TRECVID 2011 multimedia event detection, semantic indexing. TRECVID compet 1:107–123

    Google Scholar 

  2. Bay H, Tuytelaars T, Van Gool L (2006) SURF: speeded up robust features. In: Computer vision-ECCV 2006. Springer, Heidelberg, pp 404–417

    Google Scholar 

  3. Cernekova Z, Pitas I, Nikou C (2006) Information theory-based shot cut/fade detection and video summarization. IEEE Trans Circuits Syst Video Technol 16(1):82–91

    Article  Google Scholar 

  4. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

  5. Chavez GC, Precioso F, Cord M, Philipp-Foliguet S, Araujo AdA (2006) Shot boundary detection at TRECVID 2006. In: Proceedings of the TREC video retrieval evaluation, p 1–8

    Google Scholar 

  6. Delezoide B, Precioso F, Gosselin PH, Redi M, Mérialdo B, Granjon L, Pellerin D, Rombaut M, Jégou H, Vieux R et al (2011) IRIM at TRECVID 2011: semantic indexing and instance search. In: Notebook papers of the TREC video retrieval evaluation workshop (TRECVID)

    Google Scholar 

  7. Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  8. Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. IJCAI 7:1606–1611

    Google Scholar 

  9. Gauvain JL, Lamel L, Adda G (2002) The LIMSI broadcast news transcription system. Speech Commun 37(1):89–108

    Article  MATH  Google Scholar 

  10. Hamadi A, Mulhem P, Quénot G (2013) Conceptual feedback for semantic multimedia indexing. In: Proceedings of the 11th international workshop on content-based multimedia indexing (CBMI). IEEE, pp 53–58

    Google Scholar 

  11. Harris C, Stephens M (1988) A combined corner and edge detector. In: Alvey vision conference, vol 15. Manchester, p 50

    Google Scholar 

  12. Kliegr T, Chandramouli K, Nemrava J, Svatek V, Izquierdo E (2008) Combining image captions and visual analysis for image concept classification. In: Proceedings of the 9th international workshop on multimedia data mining: held in conjunction with the ACM SIGKDD 2008, MDM ’08ACM, New York, pp 8–17

    Google Scholar 

  13. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2. IEEE, pp 2169–2178

    Google Scholar 

  14. Leong CW, Mihalcea R, Hassan S (2010) Text mining for automatic image tagging. In: Proceedings of the 23rd international conference on computational linguistics: posters. Association for Computational Linguistics, pp 647–655

    Google Scholar 

  15. Lin WH, Hauptmann A (2002) News video classification using SVM-based multimodal classifiers and combination strategies. In: Proceedings of the 10th ACM international conference on multimedia. ACM, pp 323–326

    Google Scholar 

  16. Liu C, Liu H, Jiang S, Huang Q, Zheng Y, Zhang W (2006) JDL at TRECVID 2006 shot boundary detection. In: TRECVID 2006 workshop

    Google Scholar 

  17. Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the 7th IEEE international conference on computer vision, vol 2. IEEE, pp 1150–1157

    Google Scholar 

  18. Markatopoulou F, Moumtzidou A, Tzelepis C, Avgerinakis K, Gkalelis N, Vrochidis S, Mezaris V, Kompatsiaris I (2013) ITI-CERTH participation to TRECVID 2013. In: Proceedings of TRECVID 2013 workshop. TRECVID 2013

    Google Scholar 

  19. Mittal A, Cheong LF (2004) Addressing the problems of Bayesian network classification of video using high-dimensional features. IEEE Trans Knowl Data Eng 16(2):230–244

    Article  Google Scholar 

  20. Moumtzidou A, Gkalelis N, Sidiropoulos P, Dimopoulos M, Nikolopoulos S, Vrochidis S, Mezaris V, Kompatsiaris I (2012) ITI-CERTH participation to TRECVID 2012. In: Proceedings of TRECVID 2012 workshop. TRECVID 2012

    Google Scholar 

  21. Over P, Awad G, Michel M, Fiscus J, Sanders G, Kraaij W, Smeaton AF, Quénot G (2013) TRECVID 2013—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2013. NIST

    Google Scholar 

  22. Over P, Awad G, Michel M, Fiscus J, Sanders G, Shaw B, Kraaij W, Smeaton AF, Quénot G (2012) TRECVID 2012—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2012. NIST

    Google Scholar 

  23. Quénot G, Moraru D, Besacier L (2003) Clips at TRECVID: shot boundary detection and feature detection. In: TRECVID 2003 workshop notebook papers. Citeseer

    Google Scholar 

  24. Radinsky K, Agichtein E, Gabrilovich E, Markovitch S (2011) A word at a time: computing word relatedness using temporal semantic analysis. In: Proceedings of the 20th international conference on world wide web. ACM, pp 337–346

    Google Scholar 

  25. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523

    Article  Google Scholar 

  26. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47

    Article  Google Scholar 

  27. Sechidis K, Tsoumakas G, Vlahavas I (2011) On the stratification of multi-label data. In: Machine learning and knowledge discovery in databases. Springer, Berlin, pp 145–158

    Google Scholar 

  28. Sidiropoulos P, Mezaris V, Kompatsiaris I (2013) Enhancing video concept detection with the use of tomographs. In: Proceedings of the 20th IEEE international conference on image processing (ICIP), pp 3991–3995

    Google Scholar 

  29. Tsamoura E, Mezaris V, Kompatsiaris I (2008) Gradual transition detection using color coherence and other criteria in a video shot meta-segmentation framework. In: Proceedings of the 15th IEEE international conference on image processing (ICIP), pp 45–48

    Google Scholar 

  30. Van De Sande KE, Gevers T, Snoek CG (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596

    Article  Google Scholar 

  31. Wan KW, Yau WY, Roy S (2013) Metadata enrichment for news video retrieval: a graph-based propagation approach. In: Proceedings of the 21st ACM international conference on multimedia. ACM, pp 373–376

    Google Scholar 

  32. Witten I, Milne D (2008) An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Proceeding of AAAI workshop on Wikipedia and artificial intelligence: an evolving synergy. AAAI Press, Chicago, pp 25–30

    Google Scholar 

  33. Yilmaz E, Kanoulas E, Aslam JA (2008) A simple and efficient sampling method for estimating AP and NDCG. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 603–610

    Google Scholar 

  34. Zhao ZC, Cai AN (2006) Shot boundary detection algorithm in compressed domain based on adaboost and fuzzy theory. In: Advances in natural computation. Springer, Berlin, pp 617–626

    Google Scholar 

Download references

Acknowledgments

This work was supported by the European Commission under contract FP7-287911 LinkedTV.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Damianos Galanopoulos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Galanopoulos, D., Dojchinovski, M., Chandramouli, K., Kliegr, T., Mezaris, V. (2015). Multimodal Fusion: Combining Visual and Textual Cues for Concept Detection in Video. In: Baughman, A., Gao, J., Pan, JY., Petrushin, V. (eds) Multimedia Data Mining and Analytics. Springer, Cham. https://doi.org/10.1007/978-3-319-14998-1_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-14998-1_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-14997-4

  • Online ISBN: 978-3-319-14998-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics