Multi-modal fusion for associated news story retrieval

Younessian, Ehsan; Rajan, Deepu

doi:10.1007/s11042-013-1404-1

Multi-modal fusion for associated news story retrieval

Published: 08 March 2013

Volume 74, pages 2563–2585, (2015)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Ehsan Younessian¹ &
Deepu Rajan¹

384 Accesses
2 Citations
Explore all metrics

Abstract

In this paper, we investigate multi-modal approaches to retrieve associated news stories sharing the same main topic. In the visual domain, we employ near duplicate keyframe/scene detection method using local signatures to identify stories with mutual visual cues. Further, to improve the effectiveness of visual representation, we develop a semantic signature that contains pre-defined semantic visual concepts in a news story. We propose a visual concept weighting scheme to combine local and semantic signature similarities to obtain the enhanced visual content similarity. In the textual domain, we utilize Automatic Speech Recognition (ASR) and refined Optical Character Recognition (OCR) transcripts and determine the enhanced textual similarity using the proposed semantic similarity measure. To fuse textual and visual modalities, we investigate different early and late fusion approaches. In the proposed early fusion approach, we employ two methods to retrieve the visual semantics using textual information. Next, using a late fusion approach, we integrate uni-modal similarity scores and the determined early fusion similarity score to boost the final retrieval performance. Experimental results show the usefulness of the enhanced visual content similarity and the early fusion approach, and the superiority of our late fusion approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia Syst 16(6):345–379
Article Google Scholar
Aytar Y, Shah M, Luo J (2008) Utilizing semantic word similarity measures for video retrieval. In: Proceedings of IEEE conference on Computer Vision and Pattern Recognition, CVPR ’08, pp 1–8
Boyd PS, Alexander R (2008) Broadcast journalism: techniques of radio and television news. Focal Press
Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2):121–167
Article Google Scholar
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27
Article Google Scholar
Do Q, Roth D, Sammons M, Tu Y, Vydiswaran V (2009) Robust, light-weight approaches to compute lexical similarity. Technical report, University of Illinois
Donald K, Smeaton A (2005) A comparison of score, rank and probability-based fusion methods for video shot retrieval. In: Image and video retrieval, pp 61–70
Hardoon DR, Szedmak SR, Shawe-taylor JR (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Article MATH Google Scholar
Hauptmann AG, Jin R, Ng TD (2002) Multi-modal information retrieval from broadcast video using ocr and speech recognition. In: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’02, pp 160–161
Ionescu B, Mironica I, Seyerlehner K, Knees P, Schlüter J, Schedl M, Cucu H, Buzo A, Lambert P (2012) Arf @ mediaeval 2012: multimodal video classification. In: MediaEval
Jiang YG, Yang J, Ngo CW, Hauptmann AG (2009) Representations of keypoint-based semantic concept detection: a comprehensive study. IEEE Trans Multimedia 12(1):42–53
Article Google Scholar
Kolb P (2009) Experiments on the difference between semantic similarity and relatedness. In: Proceedings of the 17th Nordic conference of computational linguistics, NODALIDA ’09 vol 4, pp 81–88
Rice JA (2007) Mathematical statistic and data analysis, 3rd edn. Duxbury, Belmont, CA
Google Scholar
Sargin ME, Yemez Y, Erzin E, Tekalp AM (2007) Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans Multimedia 9(7):1396–1403
Article Google Scholar
Srikanth M, Bowden M, Moldovan D (2005) LCC at trecvid 2005. In: Proceedings of NIST TREC video retrieval evaluation. Citeseer, pp 3–6
Stark MM, Riesenfeld RF (1998) Wordnet: an electronic lexical database. In: Proceedings of 11th Eurographics workshop on rendering. MIT Press
TRECVID (2006) www-nlpir.nist.gov/projects/tv2006/tv2006.html. Retrieved 15 May 2011
Wu X, Hauptmann AG, Ngo CW (2007) Practical elimination of near-duplicates from web video search. In: Proceedings of the 15th ACM international conference on multimedia, MM ’07, pp 218–227
Wu X, Hauptmann AG, Ngo C-W (2007) Novelty detection for cross-lingual news stories with visual duplicates and speech transcripts. In: Proceedings of the 15th ACM international conference on multimedia, MM ’07, pp 168–177
Wu X, Takimoto M, Satoh S, Adachi J (2008) Scene duplicate detection based on the pattern of discontinuities in feature point trajectories. In: Proceedings of the 16th ACM international conference on multimedia, MM ’08, p 51
Yan R, Yang J, Hauptmann AG (2004) Learning query-class dependent weights in automatic video retrieval. In: Proceedings of the 12th annual ACM international conference on multimedia, MM ’04, pp 548–555
Younessian E, Rajan D (2012) Multi-modal solution for unconstrained news story retrieval. In: Proceedings of the 18th international conference on advances in Multimedia Modeling, MMM ’12, pp 186–195
Younessian E, Rajan D (2012) Scene signatures for unconstrained news video stories. In: Proceedings of the 18th international conference on advances in Multimedia Modeling, MMM ’12, pp 77–88
Younessian E, Rajan D, Chng ES (2009) Improved keypoint matching method for near-duplicate keyframe retrieval. In: Proceedings of IEEE International Symposium on Multimedia, ISM ’09, pp 298–303
Zhong Lan Z, Bao L, Yu S-I, Liu W, Hauptmann AG (2012) Double fusion for multimedia event detection. In: Proceedings of the 18th international conference on Multimedia and Modeling, MMM ’12, vol 7131. Lecture notes in computer science. Springer, pp 173–185

Download references

Author information

Authors and Affiliations

Center for Multimedia and Network Technology, School of Computer Engineering, Nanyang Technological University, Nanyang, 639798, Singapore
Ehsan Younessian & Deepu Rajan

Authors

Ehsan Younessian
View author publications
You can also search for this author in PubMed Google Scholar
Deepu Rajan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ehsan Younessian.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Younessian, E., Rajan, D. Multi-modal fusion for associated news story retrieval. Multimed Tools Appl 74, 2563–2585 (2015). https://doi.org/10.1007/s11042-013-1404-1

Download citation

Published: 08 March 2013
Issue Date: April 2015
DOI: https://doi.org/10.1007/s11042-013-1404-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-modal fusion for associated news story retrieval

Abstract

Access this article

Similar content being viewed by others

Open-Vocabulary Text-Driven Human Image Generation

Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-modal fusion for associated news story retrieval

Abstract

Access this article

Similar content being viewed by others

Open-Vocabulary Text-Driven Human Image Generation

Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation