Multimodal Fusion: Combining Visual and Textual Cues for Concept Detection in Video
- Damianos GalanopoulosAffiliated withCentre for Research and Technology Hellas, Information Technologies Institute Email author
- , Milan DojchinovskiAffiliated withWeb Engineering Group, Faculty of Information Technology, Czech Technical University in PragueDepartment of Information and Knowledge Engineering, Faculty of Informatics and Statistics, University of Economics
- , Krishna ChandramouliAffiliated withDepartment of Information and Knowledge Engineering, Faculty of Informatics and Statistics, University of Economics
- , Tomáš KliegrAffiliated withDivision of Enterprise and Cloud Computing, VIT University
- , Vasileios MezarisAffiliated withCentre for Research and Technology Hellas, Information Technologies Institute
Visual concept detection is one of the most active research areas in multimedia analysis. The goal of visual concept detection is to assign to each elementary temporal segment of a video, a confidence score for each target concept (e.g. forest, ocean, sky, etc.). The establishment of such associations between the video content and the concept labels is a key step toward semantics-based indexing, retrieval, and summarization of videos, as well as deeper analysis (e.g., video event detection). Due to its significance for the multimedia analysis community, concept detection is the topic of international benchmarking activities such as TRECVID. While video is typically a multi-modal signal composed of visual content, speech, audio, and possibly also subtitles, most research has so far focused on exploiting the visual modality. In this chapter, we introduce fusion and text analysis techniques for harnessing automatic speech recognition (ASR) transcripts or subtitles to improve the results of visual concept detection. Since the emphasis is on late fusion, the introduced algorithms for handling text and the fusion can be used in conjunction with standard algorithms for visual concept detection. We test our techniques on the TRECVID 2012 Semantic indexing (SIN) task dataset, which is made of more than 800 h of heterogeneous videos collected from Internet archives.
KeywordsVideo analysis Visual concept detection Multimodal fusion Automatic speech recognition Text analysis
- Multimodal Fusion: Combining Visual and Textual Cues for Concept Detection in Video
- Book Title
- Multimedia Data Mining and Analytics
- Book Subtitle
- Disruptive Innovation
- Book Part
- Part IV
- pp 295-310
- Print ISBN
- Online ISBN
- Springer International Publishing
- Copyright Holder
- Springer International Publishing Switzerland
- Additional Links
- Video analysis
- Visual concept detection
- Multimodal fusion
- Automatic speech recognition
- Text analysis
- Industry Sectors
- eBook Packages
- Editor Affiliations
- 1. IBM Corp.
- 2. Nokia Inc.
- 3. Google Inc.
- 4. 4i, Inc.
- Author Affiliations
- 5. Centre for Research and Technology Hellas, Information Technologies Institute, 6th Km. Charilaou - Thermi Road, P.O. Box: 60361, 57001, Thermi-Thessaloniki, Greece
- 6. Web Engineering Group, Faculty of Information Technology, Czech Technical University in Prague, Prague, Czech Republic
- 7. Department of Information and Knowledge Engineering, Faculty of Informatics and Statistics, University of Economics, Prague, Czech Republic
- 8. Division of Enterprise and Cloud Computing, VIT University, Vellore, India
To view the rest of this content please follow the download PDF link above.