Robustly detect different types of text in videos

Cai, Yuanqiang; Wang, Weiqiang

doi:10.1007/s00521-020-04729-6

Robustly detect different types of text in videos

Original Article
Published: 27 January 2020

Volume 32, pages 12827–12840, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Yuanqiang Cai¹ &
Weiqiang Wang¹

258 Accesses
2 Citations
Explore all metrics

Abstract

Text in videos can be categorized into three types: overlaid text, layered text, and scene text. Existing detection methods focus on a specific type of text and cannot obtain a good performance when working on other text types. To our knowledge, few works explore to build a system to simultaneously detect all types of text. In this paper, we present a unified video text detector, which can simultaneously localize all types of text in videos accurately. Our system consists of a spatial text detector and a temporal fusion filter. First, we explore to use three different strategies to learn the spatial text detector based on deep convolutional neural networks, so that it can simultaneously detect various texts without knowing the types of text. Then, a new area-first non-maximum suppression computation combined with multiple constraints is proposed to remove the redundant bounding boxes. Finally, the temporal fusion filter exploits the features of spatial locations and text components to integrate the detection results of consecutive frames to further remove false positives. To validate the proposed approach, comprehensive experiments are carried out on three publicly available datasets, consisting of overlaid text, layered text, and scene text. The experimental results demonstrate that our method consistently achieves the best performance compared with state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Robust Approach for Scene Text Detection and Tracking in Video

A real-time and effective text detection method for multi-scale and fuzzy text

Article 09 February 2023

SPN: short path network for scene text detection

Article 26 February 2019

Notes

The dataset will be available soon for public.
https://github.com/argman/EAST.

References

Bertini M, Del Bimbo A, Nunziati W (2006) Automatic detection of player’s identity in soccer videos using faces and text cues. In: The ACM MM. ACM, pp 663–666
Epshtein B, Ofek E, Wexler Y (2010) Detecting text in natural scenes with stroke width transform. In: The IEEE CVPR. IEEE, pp 2963–2970
Fang S, Xie H, Chen Z, Zhu S, Gu X, Gao X (2017) Detecting uyghur text in complex background images with convolutional neural network. Multimed Tools Appl 76(13):15083–15103
Article Google Scholar
Goto H, Tanaka M (2009) Text-tracking wearable camera system for the blind. In: The ICDAR. IEEE, pp 141–145
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: The CVPR, pp 770–778
Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar VR, Lu S et al (2015) ICDAR 2015 competition on robust reading. In: The ICDAR. IEEE, pp 1156–1160
Karatzas D, Shafait F, Uchida S, Iwamura M, i Bigorda LG, Mestre SR, Mas J, Mota DF, Almazan JA, de las Heras LP (2013) ICDAR 2013 robust reading competition. In: The ICDAR. IEEE, pp 1484–1493
Khare V, Shivakumara P, Raveendran P, Blumenstein M (2016) A blind deconvolution model for scene text detection and recognition in video. Pattern Recognit 54:128–148
Article Google Scholar
Liao M, Shi B, Bai X (2018) Textboxes++: A single-shot oriented scene text detector. IEEE Trans Image Process 27(8):3676–3690
Article MathSciNet Google Scholar
Liao M, Shi B, Bai X, Wang X, Liu W (2017) Textboxes: a fast text detector with a single deep neural network. In: The AAAI, pp 4161–4167
Liao M, Zhu Z, Shi B, Xia Gs, Bai X (2018) Rotation-sensitive regression for oriented scene text detection. In: The CVPR
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: The ECCV. Springer, pp 21–37
Liu X, Wang W (2010) Extracting captions from videos using temporal feature. In: The ACM MM. ACM, pp 843–846
Liu X, Wang W (2012) Robustly extracting captions in videos based on stroke-like edges and spatio-temporal analysis. IEEE Trans Multimed 14(2):482–489
Article MathSciNet Google Scholar
Liu Y, Jin L (2017) Deep matching prior network: toward tighter multi-oriented text detection. In: The CVPR
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: The CVPR, pp 3431–3440
Lucas SM (2005) ICDAR 2005 text locating competition results. In: The ICDAR. IEEE, pp 80–84
Lyu P, Yao C, Wu W, Yan S, Bai X (2018) Multi-oriented scene text detection via corner localization and region segmentation. In: The CVPR
Ma J, Wang W, Lu K, Zhou J (2017) Scene text detection based on pruning strategy of mser-trees and linkage-trees. In: The ICME. IEEE, pp 367–372
Minetto R, Thome N, Cord M, Leite NJ, Stolfi J (2011) Snoopertrack: text detection and tracking for outdoor videos. In: The ICIP. IEEE, pp 505–508
Neumann L, Matas J (2012) Real-time scene text localization and recognition. In: The CVPR. IEEE, pp 3538–3545
Nguyen PX, Wang K, Belongie S (2014) Video text detection and recognition: dataset and benchmark. In: The WACV. IEEE, pp 776–783
Ozay N, Sankur B (2009) Automatic TV logo detection and classification in broadcast videos. In: The 17th European signal processing conference. IEEE, pp 839–843
Sato T, Kanade T, Hughes EK, Smith MA (1998) Video OCR for digital news archive. In: The IEEE international workshop on content-based access of image and video database. IEEE, pp 52–60
Shahab A, Shafait F, Dengel A (2011) ICDAR 2011 robust reading competition challenge 2: reading text in scene images. In: The ICDAR. IEEE, pp 1491–1496
Shi B, Bai X, Belongie S (2017) Detecting oriented text in natural images by linking segments. In: The CVPR. IEEE
Shivakumara P, Phan TQ, Tan CL (2011) A Laplacian approach to multi-oriented text detection in video. IEEE Trans Pattern Anal Mach Intell 33(2):412–419
Article Google Scholar
Shivakumara P, Sreedhar RP, Phan TQ, Lu S, Tan CL (2012) Multioriented video scene text detection through Bayesian classification and boundary growing. IEEE Trans Circ Syst Video Technol 22(8):1227–1235
Article Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Song Y, Chen J, Xie H, Chen Z, Gao X, Chen X (2017) Robust and parallel uyghur text localization in complex background images. Mach Vis Appl 28(7):755–769
Article Google Scholar
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: The CVPR. IEEE, pp 1–9
Tanaka M, Goto H (2008) Text-tracking wearable camera system for visually-impaired people. In: The ICPR. IEEE, pp 1–4
Tian S, Yin XC, Su Y, Hao HW (2017) A unified framework for tracking based text detection and recognition from web videos. IEEE Trans Pattern Anal Mach Intell 40:542–554
Article Google Scholar
Uchida S (2014) Text localization and recognition in images and video. In: Handbook of document image processing and recognition. Springer, pp 843–883
Wang J, Duan L, Li Z, Liu J, Lu H, Jin JS (2006) A robust method for tv logo tracking in video streams. In: The ICME. IEEE, pp 1041–1044
Wu L, Shivakumara P, Lu T, Tan CL (2015) A new technique for multi-oriented scene text line detection and tracking in video. IEEE Trans Multimed 17(8):1137–1152
Article Google Scholar
Wu W, Chen X, Yang J (2004) Incremental detection of text on road signs from video with application to a driving assistant system. In: The ACM MM. ACM, pp 852–859
Wu W, Chen X, Yang J (2005) Detection of text on road signs from video. IEEE Trans Intell Transp Syst 6(4):378–390
Article Google Scholar
Yang C, Yin XC, Pei WY, Tian S, Zuo ZY, Zhu C, Yan J (2017) Tracking based multi-orientation scene text detection: a unified framework with dynamic programming. IEEE Trans Image Process 26(7):3235–3248
Article MathSciNet Google Scholar
Yao C, Bai X, Liu W, Ma Y, Tu Z (2012) Detecting texts of arbitrary orientations in natural images. In: The CVPR. IEEE, pp 1083–1090
Ye Q, Doermann D (2015) Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell 37(7):1480–1500
Article Google Scholar
Yi C, Tian Y (2011) Text string detection from natural scenes by structure-based partition and grouping. IEEE Trans Image Process 20(9):2594–2605
Article MathSciNet Google Scholar
Yin XC, Yin X, Huang K, Hao HW (2014) Robust text detection in natural scene images. IEEE Trans Pattern Anal Mach Intell 36(5):970–983
Article Google Scholar
Yin XC, Zuo ZY, Tian S, Liu CL (2016) Text detection, tracking and recognition in video: a comprehensive survey. IEEE Trans Image Process 25(6):2752–2773
Article MathSciNet Google Scholar
Zayene O, Hennebert J, Touj SM, Ingold R, Amara NEB (2015) A dataset for Arabic text detection, tracking and recognition in news videos-AcTiV. In: The ICDAR. IEEE, pp 996–1000
Zayene O, Seuret M, Touj SM, Hennebert J, Ingold R, Amara NEB (2016) Text detection in Arabic news video based on SWT operator and convolutional auto-encoders. In: The IAPR DAS. IEEE, pp 13–18
Zhang H, Liu G, Chow TWS, Liu W (2011) Textual and visual content-based anti-phishing: a Bayesian approach. IEEE Trans Neural Netw 22(10):1532–1546
Article Google Scholar
Zhang Z, Zhang C, Shen W, Yao C, Liu W, Bai X (2016) Multi-oriented text detection with fully convolutional networks. In: The CVPR, pp 4159–4167
Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, Liang J (2017) East: an efficient and accurate scene text detector. In: The CVPR
Zuo ZY, Tian S, Pei Wy, Yin XC (2015) Multi-strategy tracking based text detection in scene videos. In: The ICDAR. IEEE, pp 66–70

Download references

Funding

This work is supported by National Key R&D Program of China under contract No. 2017YFB1002203, NSFC projects under Grant 61772495, NSFC Key Projects of International (Regional) Cooperation and Exchanges under Grant 61860206004, and Ningbo 2025 Key Project of Science and Technology Innovation with No. 2018B10071.

Author information

Authors and Affiliations

School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, 101408, China
Yuanqiang Cai & Weiqiang Wang

Authors

Yuanqiang Cai
View author publications
You can also search for this author in PubMed Google Scholar
Weiqiang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weiqiang Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. This article does not contain any studies with animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cai, Y., Wang, W. Robustly detect different types of text in videos. Neural Comput & Applic 32, 12827–12840 (2020). https://doi.org/10.1007/s00521-020-04729-6

Download citation

Received: 27 May 2019
Accepted: 07 January 2020
Published: 27 January 2020
Issue Date: August 2020
DOI: https://doi.org/10.1007/s00521-020-04729-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robustly detect different types of text in videos

Abstract

Access this article

Similar content being viewed by others

A Robust Approach for Scene Text Detection and Tracking in Video

A real-time and effective text detection method for multi-scale and fuzzy text

SPN: short path network for scene text detection

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Robustly detect different types of text in videos

Abstract

Access this article

Similar content being viewed by others

A Robust Approach for Scene Text Detection and Tracking in Video

A real-time and effective text detection method for multi-scale and fuzzy text

SPN: short path network for scene text detection

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation