Skip to main content
Log in

Robustly detect different types of text in videos

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Text in videos can be categorized into three types: overlaid text, layered text, and scene text. Existing detection methods focus on a specific type of text and cannot obtain a good performance when working on other text types. To our knowledge, few works explore to build a system to simultaneously detect all types of text. In this paper, we present a unified video text detector, which can simultaneously localize all types of text in videos accurately. Our system consists of a spatial text detector and a temporal fusion filter. First, we explore to use three different strategies to learn the spatial text detector based on deep convolutional neural networks, so that it can simultaneously detect various texts without knowing the types of text. Then, a new area-first non-maximum suppression computation combined with multiple constraints is proposed to remove the redundant bounding boxes. Finally, the temporal fusion filter exploits the features of spatial locations and text components to integrate the detection results of consecutive frames to further remove false positives. To validate the proposed approach, comprehensive experiments are carried out on three publicly available datasets, consisting of overlaid text, layered text, and scene text. The experimental results demonstrate that our method consistently achieves the best performance compared with state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The dataset will be available soon for public.

  2. https://github.com/argman/EAST.

References

  1. Bertini M, Del Bimbo A, Nunziati W (2006) Automatic detection of player’s identity in soccer videos using faces and text cues. In: The ACM MM. ACM, pp 663–666

  2. Epshtein B, Ofek E, Wexler Y (2010) Detecting text in natural scenes with stroke width transform. In: The IEEE CVPR. IEEE, pp 2963–2970

  3. Fang S, Xie H, Chen Z, Zhu S, Gu X, Gao X (2017) Detecting uyghur text in complex background images with convolutional neural network. Multimed Tools Appl 76(13):15083–15103

    Article  Google Scholar 

  4. Goto H, Tanaka M (2009) Text-tracking wearable camera system for the blind. In: The ICDAR. IEEE, pp 141–145

  5. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: The CVPR, pp 770–778

  6. Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar VR, Lu S et al (2015) ICDAR 2015 competition on robust reading. In: The ICDAR. IEEE, pp 1156–1160

  7. Karatzas D, Shafait F, Uchida S, Iwamura M, i Bigorda LG, Mestre SR, Mas J, Mota DF, Almazan JA, de las Heras LP (2013) ICDAR 2013 robust reading competition. In: The ICDAR. IEEE, pp 1484–1493

  8. Khare V, Shivakumara P, Raveendran P, Blumenstein M (2016) A blind deconvolution model for scene text detection and recognition in video. Pattern Recognit 54:128–148

    Article  Google Scholar 

  9. Liao M, Shi B, Bai X (2018) Textboxes++: A single-shot oriented scene text detector. IEEE Trans Image Process 27(8):3676–3690

    Article  MathSciNet  Google Scholar 

  10. Liao M, Shi B, Bai X, Wang X, Liu W (2017) Textboxes: a fast text detector with a single deep neural network. In: The AAAI, pp 4161–4167

  11. Liao M, Zhu Z, Shi B, Xia Gs, Bai X (2018) Rotation-sensitive regression for oriented scene text detection. In: The CVPR

  12. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: The ECCV. Springer, pp 21–37

  13. Liu X, Wang W (2010) Extracting captions from videos using temporal feature. In: The ACM MM. ACM, pp 843–846

  14. Liu X, Wang W (2012) Robustly extracting captions in videos based on stroke-like edges and spatio-temporal analysis. IEEE Trans Multimed 14(2):482–489

    Article  MathSciNet  Google Scholar 

  15. Liu Y, Jin L (2017) Deep matching prior network: toward tighter multi-oriented text detection. In: The CVPR

  16. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: The CVPR, pp 3431–3440

  17. Lucas SM (2005) ICDAR 2005 text locating competition results. In: The ICDAR. IEEE, pp 80–84

  18. Lyu P, Yao C, Wu W, Yan S, Bai X (2018) Multi-oriented scene text detection via corner localization and region segmentation. In: The CVPR

  19. Ma J, Wang W, Lu K, Zhou J (2017) Scene text detection based on pruning strategy of mser-trees and linkage-trees. In: The ICME. IEEE, pp 367–372

  20. Minetto R, Thome N, Cord M, Leite NJ, Stolfi J (2011) Snoopertrack: text detection and tracking for outdoor videos. In: The ICIP. IEEE, pp 505–508

  21. Neumann L, Matas J (2012) Real-time scene text localization and recognition. In: The CVPR. IEEE, pp 3538–3545

  22. Nguyen PX, Wang K, Belongie S (2014) Video text detection and recognition: dataset and benchmark. In: The WACV. IEEE, pp 776–783

  23. Ozay N, Sankur B (2009) Automatic TV logo detection and classification in broadcast videos. In: The 17th European signal processing conference. IEEE, pp 839–843

  24. Sato T, Kanade T, Hughes EK, Smith MA (1998) Video OCR for digital news archive. In: The IEEE international workshop on content-based access of image and video database. IEEE, pp 52–60

  25. Shahab A, Shafait F, Dengel A (2011) ICDAR 2011 robust reading competition challenge 2: reading text in scene images. In: The ICDAR. IEEE, pp 1491–1496

  26. Shi B, Bai X, Belongie S (2017) Detecting oriented text in natural images by linking segments. In: The CVPR. IEEE

  27. Shivakumara P, Phan TQ, Tan CL (2011) A Laplacian approach to multi-oriented text detection in video. IEEE Trans Pattern Anal Mach Intell 33(2):412–419

    Article  Google Scholar 

  28. Shivakumara P, Sreedhar RP, Phan TQ, Lu S, Tan CL (2012) Multioriented video scene text detection through Bayesian classification and boundary growing. IEEE Trans Circ Syst Video Technol 22(8):1227–1235

    Article  Google Scholar 

  29. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  30. Song Y, Chen J, Xie H, Chen Z, Gao X, Chen X (2017) Robust and parallel uyghur text localization in complex background images. Mach Vis Appl 28(7):755–769

    Article  Google Scholar 

  31. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: The CVPR. IEEE, pp 1–9

  32. Tanaka M, Goto H (2008) Text-tracking wearable camera system for visually-impaired people. In: The ICPR. IEEE, pp 1–4

  33. Tian S, Yin XC, Su Y, Hao HW (2017) A unified framework for tracking based text detection and recognition from web videos. IEEE Trans Pattern Anal Mach Intell 40:542–554

    Article  Google Scholar 

  34. Uchida S (2014) Text localization and recognition in images and video. In: Handbook of document image processing and recognition. Springer, pp 843–883

  35. Wang J, Duan L, Li Z, Liu J, Lu H, Jin JS (2006) A robust method for tv logo tracking in video streams. In: The ICME. IEEE, pp 1041–1044

  36. Wu L, Shivakumara P, Lu T, Tan CL (2015) A new technique for multi-oriented scene text line detection and tracking in video. IEEE Trans Multimed 17(8):1137–1152

    Article  Google Scholar 

  37. Wu W, Chen X, Yang J (2004) Incremental detection of text on road signs from video with application to a driving assistant system. In: The ACM MM. ACM, pp 852–859

  38. Wu W, Chen X, Yang J (2005) Detection of text on road signs from video. IEEE Trans Intell Transp Syst 6(4):378–390

    Article  Google Scholar 

  39. Yang C, Yin XC, Pei WY, Tian S, Zuo ZY, Zhu C, Yan J (2017) Tracking based multi-orientation scene text detection: a unified framework with dynamic programming. IEEE Trans Image Process 26(7):3235–3248

    Article  MathSciNet  Google Scholar 

  40. Yao C, Bai X, Liu W, Ma Y, Tu Z (2012) Detecting texts of arbitrary orientations in natural images. In: The CVPR. IEEE, pp 1083–1090

  41. Ye Q, Doermann D (2015) Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell 37(7):1480–1500

    Article  Google Scholar 

  42. Yi C, Tian Y (2011) Text string detection from natural scenes by structure-based partition and grouping. IEEE Trans Image Process 20(9):2594–2605

    Article  MathSciNet  Google Scholar 

  43. Yin XC, Yin X, Huang K, Hao HW (2014) Robust text detection in natural scene images. IEEE Trans Pattern Anal Mach Intell 36(5):970–983

    Article  Google Scholar 

  44. Yin XC, Zuo ZY, Tian S, Liu CL (2016) Text detection, tracking and recognition in video: a comprehensive survey. IEEE Trans Image Process 25(6):2752–2773

    Article  MathSciNet  Google Scholar 

  45. Zayene O, Hennebert J, Touj SM, Ingold R, Amara NEB (2015) A dataset for Arabic text detection, tracking and recognition in news videos-AcTiV. In: The ICDAR. IEEE, pp 996–1000

  46. Zayene O, Seuret M, Touj SM, Hennebert J, Ingold R, Amara NEB (2016) Text detection in Arabic news video based on SWT operator and convolutional auto-encoders. In: The IAPR DAS. IEEE, pp 13–18

  47. Zhang H, Liu G, Chow TWS, Liu W (2011) Textual and visual content-based anti-phishing: a Bayesian approach. IEEE Trans Neural Netw 22(10):1532–1546

    Article  Google Scholar 

  48. Zhang Z, Zhang C, Shen W, Yao C, Liu W, Bai X (2016) Multi-oriented text detection with fully convolutional networks. In: The CVPR, pp 4159–4167

  49. Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, Liang J (2017) East: an efficient and accurate scene text detector. In: The CVPR

  50. Zuo ZY, Tian S, Pei Wy, Yin XC (2015) Multi-strategy tracking based text detection in scene videos. In: The ICDAR. IEEE, pp 66–70

Download references

Funding

This work is supported by National Key R&D Program of China under contract No. 2017YFB1002203, NSFC projects under Grant 61772495, NSFC Key Projects of International (Regional) Cooperation and Exchanges under Grant 61860206004, and Ningbo 2025 Key Project of Science and Technology Innovation with No. 2018B10071.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weiqiang Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. This article does not contain any studies with animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cai, Y., Wang, W. Robustly detect different types of text in videos. Neural Comput & Applic 32, 12827–12840 (2020). https://doi.org/10.1007/s00521-020-04729-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-04729-6

Keywords

Navigation