Scene Text Detection and Recognition: The Deep Learning Era

Long, Shangbang; He, Xin; Yao, Cong

doi:10.1007/s11263-020-01369-0

Scene Text Detection and Recognition: The Deep Learning Era

Published: 27 August 2020

Volume 129, pages 161–184, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

10k Accesses
221 Citations
6 Altmetric
1 Mention
Explore all metrics

Abstract

With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an important research area in computer vision, scene text detection and recognition has been inevitably influenced by this wave of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial advancements in mindset, methodology and performance. This survey is aimed at summarizing and analyzing the major changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends. Specifically, we will emphasize the dramatic differences brought by deep learning and remaining grand challenges. We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also collected in our Github repository (https://github.com/Jyouhou/SceneTextPapers).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

Microsoft COCO: Common Objects in Context

CBAM: Convolutional Block Attention Module

Notes

References

Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12), 2552–2566.
Article Google Scholar
Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 898–916.
Article Google Scholar
Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., et al. (2019a). What is wrong with scene text recognition model comparisons? Dataset and model analysis. In Proceedings of the IEEE international conference on computer vision (pp. 4715–4723).
Baek, Y., Lee, B., Han, D., Yun, S., & Lee, H. (2019b). Character region awareness for text detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 9365–9374).
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. In ICLR 2015.
Bai, F., Cheng, Z., Niu, Y., Pu, S., & Zhou, S. (2018). Edit probability for scene text recognition. In CVPR 2018.
Bartz, C., Yang, H., & Meinel, C. (2017). See: Towards semi-supervised end-to-end scene text recognition. arXiv preprint arXiv:1712.05404.
Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013). Photoocr: Reading text in uncontrolled conditions. In Proceedings of the IEEE international conference on computer vision (pp. 785–792).
Borisyuk, F., Gordo, A., & Sivakumar, V. (2018). Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 71–79). ACM.
Busta, M., Neumann, L., & Matas, J. (2015). Fastext: Efficient unconstrained scene text detector. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1206–1214).
Busta, M., Neumann, L., & Matas, J. (2017). Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In Proceedings of ICCV.
Chen, X., Yang, J., Zhang, J., & Waibel, A. (2004). Automatic detection and recognition of signs from natural scenes. IEEE Transactions on Image Processing, 13(1), 87–99.
Article Google Scholar
Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., & Zhou, S. (2017a). Focusing attention: Towards accurate text recognition in natural images. In 2017 IEEE international conference on computer vision (ICCV) (pp. 5086–5094). IEEE.
Cheng, Z., Liu, X., Bai, F., Niu, Y., Pu, S., & Zhou, S. (2017b). Arbitrarily-oriented text recognition. In CVPR2018.
Ch’ng, C.K., & Chan, C. S. (2017). Total-text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) (Vol. 1, pp. 935–942). IEEE.
Chowdhury, M. A., & Deb, K. (2013). Extracting and segmenting container name from container images. International Journal of Computer Applications, 74(19), 18–22.
Article Google Scholar
Coates, A., Carpenter, B., Case, C., Satheesh, S., Suresh, B., Wang, T., et al. (2011). Text detection and character recognition in scene images with unsupervised feature learning. In 2011 international conference on document analysis and recognition (ICDAR) (pp. 440–445). IEEE.
Dai, Y., Huang, Z., Gao, Y., & Chen, K. (2017). Fused text segmentation networks for multi-oriented scene text detection. arXiv preprint arXiv:1709.03272.
Dalal, N., & Triggs, B., (2005). Histograms of oriented gradients for human detection. In IEEE computer society conference on computer vision and pattern recognition (CVPR) (Vol. 1, pp. 886–893). IEEE.
Deng, D., Liu, H., Li, X., & Cai, D. (2018). Pixellink: Detecting scene text via instance segmentation. Proceedings of AAA, I, 2018.
Google Scholar
DeSouza, G. N., & Kak, A. C. (2002). Vision for mobile robot navigation: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), 237–267.
Article Google Scholar
Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8), 1532–1545.
Article Google Scholar
Dvorin, Y., & Havosha, U. E. (2009). Method and device for instant translation, June 4. US Patent App. 11/998,931.
Epshtein, B., Ofek, E., & Wexler, Y. (2010). Detecting text in natural scenes with stroke width transform. In 2010 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2963–2970). IEEE.
Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.
Article Google Scholar
Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Article Google Scholar
Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). DSSD: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659.
Gao, Y., Chen, Y., Wang, J., & Lu, H. (2017). Reading scene text with attention convolutional sequence modeling. arXiv preprint arXiv:1709.04303.
Girshick, R. (2015). Fast R-CNN. In The IEEE international conference on computer vision (ICCV).
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 580–587).
Goldberg, A. V. (1997). An efficient implementation of a scaling minimum-cost flow algorithm. Journal of Algorithms, 22(1), 1–29.
Article MathSciNet Google Scholar
Gordo, A. (2015). Supervised mid-level features for word image representation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2956–2964).
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on machine learning (pp. 369–376). ACM.
Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., & Fernández, S. (2008). Unconstrained on-line handwriting recognition with recurrent neural networks. In Advances in neural information processing systems (pp. 577–584).
Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2315–2324).
Ham, Y. K., Kang, M. S., Chung, H. K., Park, R.-H., & Park, G. T. (1995). Recognition of raised characters for automatic classification of rubber tires. Optical Engineering, 34(1), 102–110.
Article Google Scholar
Han, J., Zhang, D., Cheng, G., Liu, N., & Xu, D. (2018). Advanced deep-learning techniques for salient and category-specific object detection: A survey. IEEE Signal Processing Magazine, 35(1), 84–100.
Article Google Scholar
He, D., Yang, X., Liang, C., Zhou, Z., Ororbia, A. G., Kifer, D., & Giles, C. L. (2017a). Multi-scale FCN with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 474–483). IEEE.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017b). Mask R-CNN. In 2017 IEEE international conference on computer vision (ICCV) (pp. 2980–2988). IEEE.
He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., & Li, X. (2017c). Single shot text detector with regional attention. In The IEEE international conference on computer vision (ICCV).
He, P., Huang, W., Qiao, Y., Loy, C. C., & Tang, X. (2016). Reading scene text in deep convolutional sequences. In Thirtieth AAAI conference on artificial intelligence.
He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., & Sun, C. (2018). An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5020–5029).
He, W., Zhang, X.-Y., Yin, F., & Liu, C.-L. (2017d). Deep direct regression for multi-oriented scene text detection. In The IEEE international conference on computer vision (ICCV).
He, Z., Liu, J., Ma, H., & Li, P. (2005). A new automatic extraction method of container identity codes. IEEE Transactions on Intelligent Transportation Systems, 6(1), 72–78.
Article Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Hu, H., Zhang, C., Luo, Y., Wang, Y., Han, J., & Ding, E. (2017). Wordsup: Exploiting word annotations for character based text detection. In Proceedings of the IEEE international conference on computer vision. 2017.
Huang, W., Lin, Z., Yang, J., & Wang, J. (2013). Text localization in natural images using stroke feature transform and text covariance descriptors. In Proceedings of the IEEE international conference on computer vision (pp. 1241–1248).
Huang, W., Qiao, Y., & Tang, X. (2014). Robust scene text detection with convolution neural network induced MSER trees. In European conference on computer vision (pp. 497–511). Springer.
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014a). Deep structured output learning for unconstrained text recognition. In ICLR2015.
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014b). Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227.
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1), 1–20.
Article MathSciNet Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A. et al. (2015). Spatial transformer networks. In Advances in neural information processing systems (pp. 2017–2025).
Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014c). Deep features for text spotting. In In Proceedings of European conference on computer vision (ECCV) (pp. 512–528). Springer.
Jain, A. K., & Yu, B. (1998). Automatic text location in images and video frames. Pattern Recognition, 31(12), 2055–2076.
Article Google Scholar
Jiang, Y., Zhu, X., Wang, X., Yang, S., Li, W., Wang, H., Fu, P., & Luo, Z. (2017). R2CNN: rotational region CNN for orientation robust scene text detection. arXiv preprint arXiv:1706.09579.
Jung, K., Kim, K. I., & Jain, A. K. (2004). Text information extraction in images and video: A survey. Pattern Recognition, 37(5), 977–997.
Article Google Scholar
Kang, L., Li, Y., & Doermann, D. (2014). Orientation robust text line detection in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4034–4041).
Karatzas, D., & Antonacopoulos, A. (2004). Text extraction from web images based on a split-and-merge segmentation method using colour perception. In Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004 (Vol. 2, pp. 634–637). IEEE.
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., et al. (2015). ICDAR 2015 competition on robust reading. In 2015 13th international conference on document analysis and recognition (ICDAR) (pp. 1156–1160). IEEE.
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Bigorda, L. G. I., Mestre, S. R., et al. (2013). ICDAR 2013 robust reading competition. In 2013 12th international conference on document analysis and recognition (ICDAR) (pp. 1484–1493). IEEE.
Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).
Lee, C.-Y., & Osindero, S. (2016). Recursive recurrent nets with attention modeling for OCR in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2231–2239).
Lee, J.-J, Lee, P.-H., Lee, S.-W., Yuille, A., & Koch, C. (2011). Adaboost for text detection in natural scene. In 2011 international conference on document analysis and recognition (ICDAR) (pp. 429–434). IEEE.
Lee, S., & Kim, J. H. (2013). Integrating multiple character proposals for robust scene text extraction. Image and Vision Computing, 31(11), 823–840.
Article Google Scholar
Li, H., Wang, P., & Shen, C. (2017a). Towards end-to-end text spotting with convolutional recurrent neural networks. In The IEEE international conference on computer vision (ICCV).
Li, H., Wang, P., Shen, C., & Zhang, G. (2019). Show, attend and read: A simple and strong baseline for irregular text recognition. In AAAI.
Li, R., En, M., Li, J., & Zhang, H. (2017b). Weakly supervised text attention network for generating text proposals in scene images. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) (Vol. 1, pp. 324–330). IEEE.
Liao, M., Shi, B., & Bai, X. (2018a). Textboxes++: A single-shot oriented scene text detector. IEEE Transactions on Image Processing, 27(8), 3676–3690.
Article MathSciNet Google Scholar
Liao, M., Shi, B., Bai, X., Wang, X., & Liu, W. (2017). Textboxes: A fast text detector with a single deep neural network. In AAAI (pp. 4161–4167).
Liao, M., Song, B., He, M., Long, S., Yao, C., & Bai, X. (2019a). Synthtext3d: Synthesizing scene text images from 3d virtual worlds. arXiv preprint arXiv:1907.06007.
Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., Yao, C., & Bai, X. (2019b). Scene text recognition from two-dimensional perspective. In AAAI.
Liao, M., Zhu, Z., Shi, B., Xia, G.-S., & Bai, X. (2018b). Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5909–5918).
Liu, F., Shen, C., & Lin, G. (2015). Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5162–5170).
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., & Pietikäinen, M. (2018a). Deep learning for generic object detection: A survey. arXiv preprint arXiv:1809.02165.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016a). SSD: Single shot multibox detector. In In Proceedings of European conference on computer vision (ECCV) (pp. 21–37). Springer.
Liu, W., Chen, C., & Wong, K. (2018b). Char-net: A character-aware neural network for distorted scene text recognition. In AAAI conference on artificial intelligence, New Orleans, Louisiana, USA.
Liu, W., Chen, C., Wong, K.-Y. K., Su, Z., & Han, J. (2016b). Star-net: A spatial attention residue network for scene text recognition. In BMVC (Vol. 2, p. 7).
Liu, X. (1975). Old book of tang. Beijing: Zhonghua Book Company.
Google Scholar
Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., & Yan, J. (2018c). FOTS: Fast oriented text spotting with a unified network. In CVPR2018.
Liu, X., & Samarabandu, J. (2005a). An edge-based text region extraction algorithm for indoor mobile robot navigation. In 2005 IEEE international conference mechatronics and automation (Vol. 2, pp. 701–706). IEEE.
Liu, X., & Samarabandu, J. K. (2005b). A simple and fast text localization algorithm for indoor mobile robot navigation. In Image processing: Algorithms and systems IV (Vol. 5672, pp. 139–151). International Society for Optics and Photonics.
Liu, Y., & Jin, L. (2017). Deep matching prior network: Toward tighter multi-oriented text detection.
Liu, Y., Jin, L., Xie, Z., Luo, C., Zhang, S., & Xie, L. (2019). Tightness-aware evaluation protocol for scene text detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9612–9620).
Liu, Y., Jin, L., Zhang, S., & Zhang, S. (2017). Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170.
Liu, Z., Li, Y., Ren, F., Yu, H., & Goh, W. (2018d). Squeezedtext: A real-time scene text recognition by binary convolutional encoder–decoder network. In AAAI.
Liu, Z., Lin, G., Yang, S., Feng, J., Lin, W., & Goh, W. L. (2018e). Learning Markov clustering networks for scene text detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6936–6944).
Long, S., Guan, Y., Bian, K., & Yao, C. (2020). A new perspective for flexible feature gathering in scene text recognition via character anchor pooling. In ICASSP 2020—2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2458–2462. IEEE.
Long, S., Guan, Y., Wang, B., Bian, K., & Yao, C. (2019). Alchemy: Techniques for rectification based irregular scene text recognition. arXiv preprint arXiv:1908.11834.
Long, S., Ruan, J., Zhang, W., He, X., Wu, W., & Yao, C. (2018). Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of European conference on computer vision (ECCV).
Long, S., & Yao, C. (2020). Unrealtext: Synthesizing realistic scene text images from the unreal world. arXiv preprint arXiv:2003.10608.
Lyu, P., Liao, M., Yao, C., Wu, W., & Bai, X. (2018a). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of European conference on computer vision (ECCV).
Lyu, P., Yao, C., Wu, W., Yan, S., & Bai, X. (2018b). Multi-oriented scene text detection via corner localization and region segmentation. In 2018 IEEE conference on computer vision and pattern recognition (CVPR).
Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., et al. (2018). Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 20, 3111–3122.
Article Google Scholar
Mammeri, A., & Boukerche, A. et al. (2016). MSER-based text detection and communication algorithm for autonomous vehicles. In 2016 IEEE symposium on computers and communication (ISCC) (pp. 1218–1223). IEEE.
Mammeri, A., Khiari, E.-H., & Boukerche, A. (2014). Road-sign text recognition architecture for intelligent transportation systems. In 2014 IEEE 80th vehicular technology conference (VTC Fall) (pp. 1–5). IEEE.
Mishra, A., Alahari, K., & Jawahar, C. (2011). An MRF model for binarization of natural scene text. In ICDAR-international conference on document analysis and recognition. IEEE.
Mishra, A., Alahari, K., & Jawahar, C. (2012). Scene text recognition using higher order language priors. In BMVC-British machine vision conference. BMVA.
Neumann, L., & Matas, J. (2010). A method for text localization and recognition in real-world images. In Asian conference on computer vision (pp. 770–783). Springer.
Neumann, L., & Matas, J. (2012). Real-time scene text localization and recognition. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3538–3545). IEEE.
Neumann, L., & Matas, J. (2013). On combining multiple segmentations in scene text recognition. In 2013 12th international conference on document analysis and recognition (ICDAR) (pp. 523–527). IEEE.
Nomura, S., Yamanaka, K., Katai, O., Kawakami, H., & Shiose, T. (2005). A novel adaptive morphological approach for degraded character image segmentation. Pattern Recognition, 38(11), 1961–1975.
Article Google Scholar
Parkinson, C., Jacobsen, J. J., Ferguson, D. B., & Pombo, S. A. (2016). Instant translation system, Nov. 29. US Patent 9,507,772.
Qin, S., Bissacco, A., Raptis, M., Fujii, Y., & Xiao, Y. (2019). Towards unconstrained end-to-end text spotting. In Proceedings of the IEEE international conference on computer vision (pp. 4704–4714).
Qiu, W., Zhong, F., Zhang, Y., Qiao, S., Xiao, Z., Kim, T. S., & Wang, Y. (2017). Unrealcv: Virtual worlds for computer vision. In Proceedings of the 25th ACM international conference on multimedia (pp. 1221–1224). ACM.
Phan, T. Q., Shivakumara, P., Tian, S., & Tan, C. L. (2013). Recognizing text with perspective distortion in natural scenes. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 569–576).
Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. arXiv preprint.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 779–788).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).
Rodriguez-Serrano, J. A., Gordo, A., & Perronnin, F. (2015). Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision, 113(3), 193–207.
Article Google Scholar
Rodriguez-Serrano, J. A., Perronnin, F., & Meylan, F. (2013). Label embedding for text recognition. In Proceedings of the British machine vision conference. Citeseer.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. Berlin: Springer.
Google Scholar
Roy, P. P., Pal, U., Llados, J., & Delalandre, M. (2009). Multi-oriented and multi-sized touching character segmentation using dynamic programming. In 10th international conference on document analysis and recognition, 2009. IEEE.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar
Schroth, G., Hilsenbeck, S., Huitl, R., Schweiger, F., & Steinbach, E. (2011). Exploiting text-related features for content-based image retrieval. In 2011 IEEE international symposium on multimedia (pp. 77–84). IEEE.
Schulz, R., Talbot, B., Lam, O., Dayoub, F., Corke, P., Upcroft, B., & Wyeth, G. (2015). Robot navigation using human cues: A robot navigation system for symbolic goal-directed exploration. In Proceedings of the 2015 IEEE international conference on robotics and automation (ICRA 2015) (pp. 1100–1105). IEEE.
Sheshadri, K., & Divvala, S. K. (2012). Exemplar driven character recognition in the wild. In BMVC (pp. 1–10).
Shi, B., Bai, X., & Belongie, S. (2017a). Detecting oriented text in natural images by linking segments. In The IEEE conference on computer vision and pattern recognition (CVPR).
Shi, B., Bai, X., & Yao, C. (2017b). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298–2304.
Article Google Scholar
Shi, B., Wang, X., Lyu, P., Yao, C., & Bai, X. (2016). Robust scene text recognition with automatic rectification. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4168–4176).
Shi, B., Yang, M., Wang, X., Lyu, P., Bai, X., & Yao, C. (2018). Aster: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(11), 855–868.
Google Scholar
Shi, C., Wang, C., Xiao, B., Zhang, Y., Gao, S., & Zhang, Z. (2013). Scene text recognition using part-based tree-structured character detection. In 2013 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2961–2968). IEEE.
Shivakumara, P., Bhowmick, S., Su, B., Tan, C. L., & Pal, U. (2011). A new gradient based character segmentation method for video text recognition. In 2011 international conference on document analysis and recognition (ICDAR). IEEE.
Su, B., & Lu, S. (2014). Accurate scene text recognition based on recurrent neural network. In Asian conference on computer vision (pp. 35–48). Springer.
Sun, Y., Liu, J., Liu, W., Han, J., Ding, E., & Liu, J. (2019). Chinese street view text: Large-scale Chinese text reading with partially supervised learning. In Proceedings of the IEEE international conference on computer vision (pp. 9086–9095).
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104–3112).
Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., & Tan, C. L. (2015). Text flow: A unified text detection system in natural scene images. In Proceedings of the IEEE international conference on computer vision (pp. 4651–4659).
Tian, S., Lu, S., & Li, C. (2017). Wetext: Scene text detection under weak supervision. In Proceedings of ICCV.
Tian, Z. Huang, W., He, T., He, P., & Qiao, Y. (2016). Detecting text in natural image with connectionist text proposal network. In In Proceedings of European conference on computer vision (ECCV) (pp. 56–72). Springer.
Tian, Z., Shu, M., Lyu, P., Li, R., Zhou, C., Shen, X., & Jia, J. (2019). Learning shape-aware embedding for scene text detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4234–4243).
Tsai, S. S., Chen, H., Chen, D., Schroth, G., Grzeszczuk, R., & Girod, B. (2011). Mobile visual search on printed documents using text and low bit-rate features. In 18th IEEE international conference on image processing (ICIP) (pp. 2601–2604). IEEE.
Tu, Z., Ma, Y., Liu, W., Bai, X., & Yao, C. (2012). Detecting texts of arbitrary orientations in natural images. In 2012 IEEE conference on computer vision and pattern recognition (pp. 1083–1090). IEEE.
Uchida, S. (2014). Text localization and recognition in images and video. In Handbook of document image processing and recognition (pp. 843–883). Springer.
Wachenfeld, S., Klein, H.-U., & Jiang, X. (2006). Recognition of screen-rendered text. In 18th international conference on pattern recognition, 2006. ICPR 2006 (Vol. 2, pp. 1086–1089). IEEE.
Wakahara, T., & Kita, K. (2011). Binarization of color character strings in scene images using k-means clustering and support vector machines. In 2011 international conference on document analysis and recognition (ICDAR) (pp. 274–278). IEEE.
Wang, C., Yin, F., & Liu, C.-L. (2017). Scene text detection with novel superpixel based character candidate extraction. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) (Vol. 1, pp. 929–934). IEEE.
Wang, F., Zhao, L., Li, X., Wang, X., & Tao, D. (2018). Geometry-aware scene text detection with instance transformation network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1381–1389).
Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text recognition. In 2011 IEEE international conference on computer vision (ICCV), (pp. 1457–1464). IEEE.
Wang, T., Wu, D. J., Coates, A., & Ng, A. Y. (2012). End-to-end text recognition with convolutional neural networks. In 2012 21st international conference on pattern recognition (ICPR) (pp. 3304–3308). IEEE.
Wang, W., Xie, E., Li, X., Hou, W., Lu, T., Yu, G., & Shao, S. (2019a). Shape robust text detection with progressive scale expansion network. Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Wang, X., Jiang, Y., Luo, Z., Liu, C.-L., Choi, H., & Kim, S. (2019b). Arbitrary shape scene text detection with adaptive text region representation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6449–6458).
Weinman, J., Learned-Miller, E., & Hanson, A. (2007). Fast lexicon-based scene text recognition with sparse belief propagation. In ICDAR (pp. 979–983). IEEE.
Wolf, C., & Jolion, J.-M. (2006). Object count/area graphs for the evaluation of object detection and segmentation algorithms. International Journal of Document Analysis and Recognition (IJDAR), 8(4), 280–296.
Article Google Scholar
Wu, L., Zhang, C., Liu, J., Han, J., Liu, J., Ding, E., & Bai, X. (2019). Editing text in the wild. In Proceedings of the 27th ACM international conference on multimedia (pp. 1500–1508).
Wu, Y., & Natarajan, P. (2017). Self-organized text detection with minimal post-processing via border learning. In Proceedings of the IEEE conference on CVPR (pp. 5000–5009).
Xia, Y., Tian, F., Wu, L., Lin, J., Qin, T., Yu, N., & Liu, T.-Y. (2017). Deliberation networks: Sequence generation beyond one-pass decoding. In Advances in neural information processing systems (pp. 1784–1794).
Xing, L., Tian, Z., Huang, W., & Scott, M. R. (2019). Convolutional character networks. In Proceedings of the IEEE international conference on computer vision (pp. 9126–9136).
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048–2057).
Xue, C., Lu, S., & Zhan, F. (2018). Accurate scene text detection through border semantics awareness and bootstrapping. In In Proceedings of European conference on computer vision (ECCV).
Yang, M., Guan, Y., Liao, M., He, X., Bian, K., Bai, S., et al. (2019). Symmetry-constrained rectification network for scene text recognition. In Proceedings of the IEEE international conference on computer vision (pp. 9147–9156).
Yang, Q., Jin, H., Huang, J., & Lin, W. (2020). Swaptext: Image based texts transfer in scenes. arXiv preprint arXiv:2003.08152.
Yang, X., He, D., Zhou, Z., Kifer, D., & Giles, C. L. (2017). Learning to read irregular text with attention mechanisms. In Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17 (pp. 3280–3286).
Yao, C., Bai, X., Shi, B., & Liu, W. (2014). Strokelets: A learned multi-scale representation for scene text recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4042–4049).
Yao, C., Bai, X., Sang, N., Zhou, X., Zhou, S., & Cao, Z. (2016). Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002.
Ye, Q., & Doermann, D. (2015). Text detection and recognition in imagery: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(7), 1480–1500.
Article Google Scholar
Ye, Q., Gao, W., Wang, W., & Zeng, W. (2003). A robust text detection algorithm in images and video frames. In IEEE ICICS-PCM (pp. 802–806).
Yi, C., & Tian, Y. (2011). Text string detection from natural scenes by structure-based partition and grouping. IEEE Transactions on Image Processing, 20(9), 2594–2605.
Article MathSciNet Google Scholar
Yin, F., Wu, Y.-C, Zhang, X.-Y., & Liu, C.-L. (2017). Scene text recognition with sliding convolutional character models. arXiv preprint arXiv:1709.01727.
Yin, X.-C., Yin, X., Huang, K., & Hao, H.-W. (2014). Robust text detection in natural scene images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 970–983.
Article Google Scholar
Yin, X.-C., Zuo, Z.-Y., Tian, S., & Liu, C.-L. (2016). Text detection, tracking and recognition in video: A comprehensive survey. IEEE Transactions on Image Processing, 25(6), 2752–2773.
Article MathSciNet Google Scholar
Yu, D., Li, X., Zhang, C., Han, J., Liu, J., & Ding, E. (2020). Towards accurate scene text recognition with semantic reasoning networks. arXiv preprint arXiv:2003.12294.
Yuan, T.-L., Zhu, Z., Xu, K., Li, C.-J., & Hu, S.-M. (2018). Chinese text in the wild. arXiv preprint arXiv:1803.00085.
Zhan, F., & Lu, S. (2019). ESIR: End-to-end scene text recognition via iterative image rectification. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Zhan, F., Lu, S., & Xue, C. (2018). Verisimilar image synthesis for accurate detection and recognition of texts in scenes.
Zhang, C., Liang, B., Huang, Z., En, M., Han, J., Ding, E., & Ding, X. (2019). Look more than once: An accurate detector for text of arbitrary shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zhang, D., & Chang, S.-F. (2003). A Bayesian framework for fusing multiple word knowledge models in videotext recognition. In Computer vision and pattern recognition, 2003. IEEE.
Zhang, S., Liu, Y., Jin, L., & Luo, C. (2018). Feature enhancement network: A refined scene text detector. In Proceedings of AAAI, 2018.
Zhang, S.-X., Zhu, X., Hou, J.-B., Liu, C., Yang, C., Wang, H., & Yin, X.-C. (2020). Deep relational reasoning graph network for arbitrary shape text detection. arXiv preprint arXiv:2003.07493.
Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., & Bai, X. (2016). Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).
Zhiwei, Z., Linlin, L., & Lim, T. C. (2010). Edge based binarization for video text images. In 2010 20th international conference on pattern recognition (ICPR) (pp. 133–136). IEEE.
Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., & Liang, J. (2017). EAST: An efficient and accurate scene text detector. In The IEEE conference on computer vision and pattern recognition (CVPR).
Zhu, Y., Yao, C., & Bai, X. (2016). Scene text detection and recognition: Recent advances and future trends. Frontiers of Computer Science, 10(1), 19–36.
Article Google Scholar
Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In Proceedings of European conference on computer vision (ECCV) (pp. 391–405). Springer.

Download references

Author information

Authors and Affiliations

Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
Shangbang Long
ByteDance Ltd, Beijing, China
Xin He
MEGVII Inc. (Face++), Beijing, China
Cong Yao

Authors

Shangbang Long
View author publications
You can also search for this author in PubMed Google Scholar
Xin He
View author publications
You can also search for this author in PubMed Google Scholar
Cong Yao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shangbang Long.

Additional information

Communicated by Vittorio Ferrari.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Long, S., He, X. & Yao, C. Scene Text Detection and Recognition: The Deep Learning Era. Int J Comput Vis 129, 161–184 (2021). https://doi.org/10.1007/s11263-020-01369-0

Download citation

Received: 14 April 2020
Accepted: 08 August 2020
Published: 27 August 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s11263-020-01369-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scene Text Detection and Recognition: The Deep Learning Era

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Microsoft COCO: Common Objects in Context

CBAM: Convolutional Block Attention Module

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scene Text Detection and Recognition: The Deep Learning Era

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Microsoft COCO: Common Objects in Context

CBAM: Convolutional Block Attention Module

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation