Skip to main content

Text Detection in Multimodal Video Analysis

  • Chapter
  • First Online:
Video Text Detection

Part of the book series: Advances in Computer Vision and Pattern Recognition ((ACVPR))

Abstract

Most video streams involve more than one modality for conveying hints related to the nature of the underlying contents. In general, video data compose of three low-level modalities, namely, the visual modality (i.e., visual objects, motions, and scene changes), the auditory modality which can be structural foreground or unstructured background sounds in audio sources, and the textual modality such as natural video texts or man-made overlapped dialogues. The concurrent analysis of multimodal information modalities has thus potentially emerged as a more efficient way in automatic video content access especially in the recent years. This chapter introduces text detection in multimodal video analysis from a new view as follows. We first introduce the relevance of different modalities existing in video, namely, the auditory, the visual, and the textual modalities. General multimodal data fusion schemes for video analysis are discussed, and two examples for connecting video texts and other modalities are also given. Then we give a brief overview on the recent multimodal correlation models which integrate the video textual modality. Next, we discuss multimodal video applications such as text detection and OCR for person identification from broadcast videos, multimodal content-based structure analysis of karaoke, text detection for multimodal movie abstraction and retrieval, and web video classification through text modality. As a summary, text detection in multimodal video analysis is still the state-of-the-art problem but will become more important in the next decade.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Beal M, Attias H, Jojic N (2002) Audio-video sensor fusion with probabilistic graphical models. In: Heyden A et al (eds) Computer vision – ECCV 2002. Springer, Berlin, pp 736–50

    Google Scholar 

  2. Lin W, Lu T, Su F (2012) A novel multi-modal integration and propagation model for cross-media information retrieval. In: Schoeffmann K et al (eds) Advances in multimedia modeling. Springer, Berlin, pp 740–749

    Chapter  Google Scholar 

  3. Yu B et al (2003) Video summarization based on user log enhanced link analysis. In: Proceedings of the eleventh ACM international conference on multimedia, ACM, Berkeley, pp 382–391

    Google Scholar 

  4. Datta R, Li J, Wang JZ (2005) Content-based image retrieval: approaches and trends of the new age. In: Proceedings of the 7th ACM SIGMM international workshop on multimedia information retrieval. ACM, Hilton, Singapore, pp 253–262

    Google Scholar 

  5. Yue-ting Z, Yi Y, Fei W (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimed 10(2):221–229

    Article  Google Scholar 

  6. Zhang H, Zhuang Y, Wu F (2007) Cross-modal correlation learning for clustering on image-audio dataset. In: Proceedings of the 15th international conference on multimedia. ACM, Augsburg, pp 273–276

    Google Scholar 

  7. Fei W, Yanan L, Yueting Z (2009) Tensor-based transductive learning for multimodality video semantic concept detection. IEEE Trans Multimed 11(5):868–878

    Article  Google Scholar 

  8. Hongchuan Y, Bennamoun M (2006) 1D-PCA, 2D-PCA to nD-PCA. In: ICPR 2006. 18th international conference on pattern recognition, 2006

    Google Scholar 

  9. Snoek CGM, Worring M, Smeulders AWM (2005) Early versus late fusion in semantic video analysis. In: Proceedings of the 13th annual ACM international conference on multimedia. ACM, Hilton, Singapore, pp 399–402

    Google Scholar 

  10. Westerveld T et al (2003) A probabilistic multimedia retrieval model and its evaluation. EURASIP J Appl Sig Process 2003:186–198

    Article  MATH  Google Scholar 

  11. Zhu Q, Yeh M-C, Cheng K-T (2006) Multimodal fusion using learned text concepts for image categorization. In: Proceedings of the 14th annual ACM international conference on multimedia. ACM, Santa Barbara, pp 211–220

    Google Scholar 

  12. Karaoglu S, Gemert J, Gevers T (2012) Object reading: text recognition for object recognition. In: Fusiello A, Murino V, Cucchiara R (eds) Computer vision – ECCV 2012. Workshops and demonstrations. Springer, Berlin, pp 456–465

    Chapter  Google Scholar 

  13. Jourdan M, Bes F (2001) A new step towards multimedia documents generation. In: International conference on media futures

    Google Scholar 

  14. Scherp A (2008) Canonical processes for creating personalized semantically rich multimedia presentations. Multimedia Systems 14(6):415–425

    Article  Google Scholar 

  15. Blei DM, Jordan MI (2003) Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM, Toronto, pp 127–134

    Google Scholar 

  16. Barnard K et al (2003) Matching words and pictures. J Mach Learn Res 3:1107–1135

    MATH  Google Scholar 

  17. Sidhom S, David A (2006) Automatic indexing of multimedia documents as a starting point to annotation process. In: Proceedings of the 9th International ISKO Conference

    Google Scholar 

  18. Bloehdorn S et al (2005) Semantic annotation of images and videos for multimedia analysis. In: Gómez-Pérez A, Euzenat J (eds) The semantic web: research and applications. Springer, Berlin, pp 592–607

    Google Scholar 

  19. Mitschick A (2010) Ontology-based indexing and contextualization of multimedia documents for personal information management applications. Int J Adv Softw 3(1 and 2):31–40

    Google Scholar 

  20. Wang X-J et al (2004) Multi-model similarity propagation and its application for web image retrieval. In: Proceedings of the 12th annual ACM international conference on multimedia, ACM, New York, pp 944–951

    Google Scholar 

  21. Kyperountas M, Kotropoulos C, Pitas I (2007) Enhanced Eigen-audioframes for audiovisual scene change detection. IEEE Trans Multimed 9(4):785–97

    Article  Google Scholar 

  22. Yamamoto M et al (2005) Towards understanding of multimedia documents: a trial of picture book analysis and generation. In: Proceedings of the seventh IEEE international symposium on multimedia

    Google Scholar 

  23. Yi Y et al (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans Multimed 10(3):437–46

    Article  Google Scholar 

  24. Iria J, Magalhaes J (2009) Exploiting cross-media correlations in the categorization of multimedia web documents. In: Proceedings of the CIAM 2009

    Google Scholar 

  25. Wang J et al (2003) ReCoM: reinforcement clustering of multi-type interrelated data objects. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval. ACM, Toronto, pp 274–281

    Google Scholar 

  26. Jinjun W et al (2007) Generation of personalized music sports video using multimodal cues. IEEE Trans Multimed 9(3):576–588

    Article  Google Scholar 

  27. Mi-Mi L et al (2010) Multi-modal feature integration for story boundary detection in broadcast news. In: Proceedings of the 7th international symposium on Chinese spoken language processing (ISCSLP), 2010

    Google Scholar 

  28. Poignant J et al (2012) From text detection in videos to person identification. In: Proceedings of the IEEE international conference on multimedia and expo (ICME), 2012

    Google Scholar 

  29. Satoh S, Kanade T (1997) Name-It: association of face and name in video. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, 1997

    Google Scholar 

  30. Poignant J et al (2011) Text detection and recognition for person identification in videos. In: 9th international workshop on content-based multimedia indexing (CBMI), 2011

    Google Scholar 

  31. Ming-yu C, Hauptmann A (2004) Searching for a specific person in broadcast news video. In: ICASSP ’04. Proceedings of the IEEE international conference on acoustics, speech, and signal processing, 2004

    Google Scholar 

  32. Ming Z et al (2006) Multi-faceted contextual model for person identification in news video. In: Proceedings of the 12th international conference on multi-media modelling, 2006

    Google Scholar 

  33. Zhang J et al (2009) A subword normalized cut approach to automatic story segmentation of Chinese Broadcast News. In: Lee G et al (eds) Information retrieval technology. Springer, Berlin, pp 136–148

    Chapter  Google Scholar 

  34. Zhu Y, Chen K, Sun Q (2005) Multimodal content-based structure analysis of karaoke music. In: Proceedings of the 13th annual ACM international conference on multimedia, ACM, Hilton, Singapore, pp 638–647

    Google Scholar 

  35. Ying L et al (2006) Techniques for movie content analysis and skimming: tutorial and overview on video abstraction techniques. IEEE Sig Process Mag 23(2):79–89

    Article  Google Scholar 

  36. Yueting Z et al (1998) Adaptive key frame extraction using unsupervised clustering. In: ICIP 98. Proceedings of the 1998 international conference on image processing, 1998

    Google Scholar 

  37. Hauptmann A (2005) Lessons for the future from a decade of informedia video analysis research. In: Leow W-K et al (eds) Image and video retrieval. Springer, Berlin, pp 1–10

    Chapter  Google Scholar 

  38. Evangelopoulos G et al (2009) Video event detection and summarization using audio, visual and text saliency. In: ICASSP 2009. Proceedings of the IEEE international conference on acoustics, speech and signal processing, 2009

    Google Scholar 

  39. Jiang P (2010) Keyframe-based video summary using visual attention clues. In: Qin X-L (ed), pp 64–73

    Google Scholar 

  40. Vasconcelos N, Lippman A (1998) A spatiotemporal motion model for video summarization. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, 1998

    Google Scholar 

  41. Ma Y-F et al (2002) A user attention model for video summarization. In: Proceedings of the tenth ACM international conference on multimedia. ACM, Juan-les-Pins, France, pp 533–542

    Google Scholar 

  42. Yang L et al (2007) Multi-modality web video categorization. In: Proceedings of the international workshop on workshop on multimedia information retrieval. ACM, Augsburg, Bavaria, Germany, pp 265–274

    Google Scholar 

  43. Kan MY, Wang Y, Iskandar D, New TL, Shenoy A (2008) LyricAlly: automatic synchronization of textual lyrics to acoustic music signals. IEEE Trans Audio Speech Lang Process 16(2):338–349

    Article  Google Scholar 

  44. Mayer R, Rauber A (2010) Multimodal aspects of music retrieval: audio, song lyrics – and beyond? Adv Music Inf Retr Stud Comput Intell 274:333–363

    Article  Google Scholar 

  45. Jin YK, Lu T, Su F (2012) Movie keyframe retrieval based on cross-media correlation detection and context model. In: IEA/AIE’, pp 816–825

    Google Scholar 

  46. Aradhye H, Toderici G, Yagnik J. Video2Text: learning to annotate video content. In: ICDWW’09, pp 144–151

    Google Scholar 

  47. Wu X, Zhao WL, Ngo CW (2009) Towards google challenge: combining contextual and social information for web video categorization. In: ACM multimedia, pp 1109–1110

    Google Scholar 

  48. Lu T, Jin YK, Su F, Shivakumara P, Tan CL. Content-oriented multimedia document understanding through cross-media correlation. Multimed Tools Appl, to appear

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag London

About this chapter

Cite this chapter

Lu, T., Palaiahnakote, S., Tan, C.L., Liu, W. (2014). Text Detection in Multimodal Video Analysis. In: Video Text Detection. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-6515-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-6515-6_9

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-6514-9

  • Online ISBN: 978-1-4471-6515-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics