Abstract
In this paper, we investigate multi-modal approaches to retrieve associated news stories sharing the same main topic. In the visual domain, we employ near duplicate keyframe/scene detection method using local signatures to identify stories with mutual visual cues. Further, to improve the effectiveness of visual representation, we develop a semantic signature that contains pre-defined semantic visual concepts in a news story. We propose a visual concept weighting scheme to combine local and semantic signature similarities to obtain the enhanced visual content similarity. In the textual domain, we utilize Automatic Speech Recognition (ASR) and refined Optical Character Recognition (OCR) transcripts and determine the enhanced textual similarity using the proposed semantic similarity measure. To fuse textual and visual modalities, we investigate different early and late fusion approaches. In the proposed early fusion approach, we employ two methods to retrieve the visual semantics using textual information. Next, using a late fusion approach, we integrate uni-modal similarity scores and the determined early fusion similarity score to boost the final retrieval performance. Experimental results show the usefulness of the enhanced visual content similarity and the early fusion approach, and the superiority of our late fusion approach.
Similar content being viewed by others
References
Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia Syst 16(6):345–379
Aytar Y, Shah M, Luo J (2008) Utilizing semantic word similarity measures for video retrieval. In: Proceedings of IEEE conference on Computer Vision and Pattern Recognition, CVPR ’08, pp 1–8
Boyd PS, Alexander R (2008) Broadcast journalism: techniques of radio and television news. Focal Press
Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2):121–167
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27
Do Q, Roth D, Sammons M, Tu Y, Vydiswaran V (2009) Robust, light-weight approaches to compute lexical similarity. Technical report, University of Illinois
Donald K, Smeaton A (2005) A comparison of score, rank and probability-based fusion methods for video shot retrieval. In: Image and video retrieval, pp 61–70
Hardoon DR, Szedmak SR, Shawe-taylor JR (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Hauptmann AG, Jin R, Ng TD (2002) Multi-modal information retrieval from broadcast video using ocr and speech recognition. In: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’02, pp 160–161
Ionescu B, Mironica I, Seyerlehner K, Knees P, Schlüter J, Schedl M, Cucu H, Buzo A, Lambert P (2012) Arf @ mediaeval 2012: multimodal video classification. In: MediaEval
Jiang YG, Yang J, Ngo CW, Hauptmann AG (2009) Representations of keypoint-based semantic concept detection: a comprehensive study. IEEE Trans Multimedia 12(1):42–53
Kolb P (2009) Experiments on the difference between semantic similarity and relatedness. In: Proceedings of the 17th Nordic conference of computational linguistics, NODALIDA ’09 vol 4, pp 81–88
Rice JA (2007) Mathematical statistic and data analysis, 3rd edn. Duxbury, Belmont, CA
Sargin ME, Yemez Y, Erzin E, Tekalp AM (2007) Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans Multimedia 9(7):1396–1403
Srikanth M, Bowden M, Moldovan D (2005) LCC at trecvid 2005. In: Proceedings of NIST TREC video retrieval evaluation. Citeseer, pp 3–6
Stark MM, Riesenfeld RF (1998) Wordnet: an electronic lexical database. In: Proceedings of 11th Eurographics workshop on rendering. MIT Press
TRECVID (2006) www-nlpir.nist.gov/projects/tv2006/tv2006.html. Retrieved 15 May 2011
Wu X, Hauptmann AG, Ngo CW (2007) Practical elimination of near-duplicates from web video search. In: Proceedings of the 15th ACM international conference on multimedia, MM ’07, pp 218–227
Wu X, Hauptmann AG, Ngo C-W (2007) Novelty detection for cross-lingual news stories with visual duplicates and speech transcripts. In: Proceedings of the 15th ACM international conference on multimedia, MM ’07, pp 168–177
Wu X, Takimoto M, Satoh S, Adachi J (2008) Scene duplicate detection based on the pattern of discontinuities in feature point trajectories. In: Proceedings of the 16th ACM international conference on multimedia, MM ’08, p 51
Yan R, Yang J, Hauptmann AG (2004) Learning query-class dependent weights in automatic video retrieval. In: Proceedings of the 12th annual ACM international conference on multimedia, MM ’04, pp 548–555
Younessian E, Rajan D (2012) Multi-modal solution for unconstrained news story retrieval. In: Proceedings of the 18th international conference on advances in Multimedia Modeling, MMM ’12, pp 186–195
Younessian E, Rajan D (2012) Scene signatures for unconstrained news video stories. In: Proceedings of the 18th international conference on advances in Multimedia Modeling, MMM ’12, pp 77–88
Younessian E, Rajan D, Chng ES (2009) Improved keypoint matching method for near-duplicate keyframe retrieval. In: Proceedings of IEEE International Symposium on Multimedia, ISM ’09, pp 298–303
Zhong Lan Z, Bao L, Yu S-I, Liu W, Hauptmann AG (2012) Double fusion for multimedia event detection. In: Proceedings of the 18th international conference on Multimedia and Modeling, MMM ’12, vol 7131. Lecture notes in computer science. Springer, pp 173–185
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Younessian, E., Rajan, D. Multi-modal fusion for associated news story retrieval. Multimed Tools Appl 74, 2563–2585 (2015). https://doi.org/10.1007/s11042-013-1404-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-013-1404-1