Advertisement

Review of Scene Text Detection and Recognition

  • Han Lin
  • Peng YangEmail author
  • Fanlong Zhang
Original Paper
  • 60 Downloads

Abstract

Scene texts contain rich semantic information which may be used in many vision-based applications, and consequently detecting and recognizing scene texts have received increasing attention in recent years. In this paper, we first introduce the history and progress of scene text detection and recognition, and classify conventional methods in detail and point out their advantages as well as disadvantages. After that, we study these methods and illustrate the corresponding key issues and techniques, including loss function, multi-orientation, language model and sequence labeling. Finally, we describe commonly used benchmark datasets and evaluation protocols, based on which the performance of representative scene text detection and recognition methods are analyzed and compared.

Notes

Funding

This work is supported in part by the Sub Project of National Key Research and Development Program (2017YFC0804002) and the National Natural Science Foundation of China (61662048, 61772277, 71771125 and 61603192).

Compliance with Ethical Standards

Conflict of interest

The authors declare that they have no conflict of interest.

References

  1. 1.
    Yin XC, Yin X, Huang K, Hao HW (2014) Robust text detection in natural scene images. IEEE Trans Pattern Anal Mach Intell 36:970–983CrossRefGoogle Scholar
  2. 2.
    JJ Weinman, E Learned-Miller, AR Hanson (2009) Scene Text Recognition Using Similarity and a Lexicon with Sparse Belief Propagation. IEEE Transactions on Pattern Analysis & Machine Intelligence 1733-1746Google Scholar
  3. 3.
    Karaoglu S, Tao R, Gevers T, Smeulders AWM (2017) Words matter: scene text for image classification and retrieval. IEEE Trans Multimedia 31:1063–1076CrossRefGoogle Scholar
  4. 4.
    Ye Q, Doermann D (2015) Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell 37:1480–1500CrossRefGoogle Scholar
  5. 5.
    Uchida S (2014) Text localization and recognition in images and video. Handbook of document image processing and recognition. Springer, London, pp 843–883CrossRefGoogle Scholar
  6. 6.
    Babenko B, Belongie S (2012) End-to-end scene text recognition. In: IEEE international conference on computer vision, pp 1457–1464Google Scholar
  7. 7.
    Pan YF, Hou X, Liu CL (2011) A hybrid approach to detect and localize texts in natural scene images 20:800–813Google Scholar
  8. 8.
    Mishra A, Alahari K, Jawahar CV (2012) Scene text recognition using higher order language priors. In: Proceedings british machine vision conference, pp 1–11Google Scholar
  9. 9.
    Wang T, Wu DJ, Coates A, Ng AY (2012) End-to-end text recognition with convolutional neural networks. In: International conference on pattern recognition, pp 3304–3308Google Scholar
  10. 10.
    Jaderberg M, Vedaldi A, Zisserman A (2014) Deep features for text spotting. In: European conference on computer vision, pp 512–528Google Scholar
  11. 11.
    Epshtein B, Ofek E, Wexler Y (2010) Detecting text in natural scenes with stroke width transform. In: Computer vision & pattern recognition, pp 2963–2970Google Scholar
  12. 12.
    Neumann L, Matas J (2010) A method for text localization and recognition. In: Asian conference on computer vision, pp 770–783Google Scholar
  13. 13.
    L Neumann (2012) Real-time scene text localization and recognition. In: Computer vision & pattern recognition, pp 3538–3545Google Scholar
  14. 14.
    Neumann L, Matas J (2015) Real-time lexicon-free scene text localization and recognition. IEEE Trans Pattern Anal Mach Intell 38:1872–1885CrossRefGoogle Scholar
  15. 15.
    Yin XC, Yin X, Huang K, Hao HW (2014) Robust text detection in natural scene images. IEEE Trans Pattern Anal Mach Intell 36:970–983CrossRefGoogle Scholar
  16. 16.
    Huang W, Qiao Y, Tang X (2014) Robust scene text detection with convolution neural network induced MSER trees. In: European Conference on Computer Vision, pp 497–511Google Scholar
  17. 17.
    Gomez L, Karatzas D (2015) Object proposals for text extraction in the wild. In: International conference on document analysis and recognition, pp 206–210Google Scholar
  18. 18.
    Buta M, Neumann L, Matas J (2015) FASText efficient unconstrained scene text detector. In: IEEE international conference on computer vision, pp 1206–1214Google Scholar
  19. 19.
    Zhang Z, Shen W, Yao C, Bai X (2015) Symmetry-based text line detection in natural scenes. In: IEEE conference on computer vision and pattern recognition, pp 2558–2567Google Scholar
  20. 20.
    Cho H, Sung M, Jun B (2016) CannyText detector fast and robust scene text localization algorithm. In: IEEE conference on computer vision and pattern recognition, pp 3566–3573Google Scholar
  21. 21.
    Fabrizio J, Robert-Seidowsky M, Dubuisson S, Calarasanu S (2016) TextCatcher: a method to detect curved and challenging text in natural scenes. In: International conference on document analysis and recognition, pp 99–117Google Scholar
  22. 22.
    He T, Huang W, Qiao Y, Yao J (2016) Text-attentional convolutional neural networks for scene text detection. IEEE Trans Image Process 25:2529–2541MathSciNetCrossRefGoogle Scholar
  23. 23.
    Zhu Y, Yao C, Bai X (2016) Scene text detection and recognition: recent advances and future trends. Front Comput Sci 10:19–36CrossRefGoogle Scholar
  24. 24.
    Yao C, Bai X, Sang N, Zhou X, Zhou S (2016) SceneText detection via holistic, multi-channel prediction. arXiv:1606.09002 pp 1–10
  25. 25.
    Zhang Z, Zhang C, Shen W, Yao C, Liu W (2016) Multi-oriented text detection with fully convolutional networks. In: Computer vision & pattern recognition, pp 4159–4167Google Scholar
  26. 26.
    Qin S, Manduchi R (2017) Cascaded segmentation-detection networks for word-level text spotting. In: International conference on document analysis and recognition, pp 1275–1282Google Scholar
  27. 27.
    Dai Y, Huang Z, Gao Y, Xu Y, Chen K (2017) Fused text segmentation networks for multi-oriented scene text detection, pp 1–6. arXiv:1709.03272
  28. 28.
    Deng D, Liu H, Li X, Cai D (2018) PixelLink: detecting scene text via instance segmentation. In: Proceedings of association for the advancement of artificial intelligence, pp 1–8Google Scholar
  29. 29.
    Liu W, Anguelov D, Erhan D, Szegedy C, Reed S (2016) SSD: single shot multibox detector. In: European conference on computer vision, pp 21–37Google Scholar
  30. 30.
    Li X, Wang W, Hou W, Liu RZ, Lu T (2018) Shape robust text detection with progressive scale expansion network, pp 1–12. arXiv:1806.02559
  31. 31.
    Yang Q, Cheng M, Zhou W, Chen Y, Qiu M (2018) IncepText: a new inception-text module with deformable psroi pooling for multi-oriented scene text detection. In: International joint conference on artificial intelligence, pp 1–7Google Scholar
  32. 32.
    Dai J, Qi H, Xiong Y, Li Y, Zhang G (2017) Deformable convolutional networks. In: IEEE international conference on computer vision, pp 764–773Google Scholar
  33. 33.
    He T, Huang W, Qiao Y, Yao J (2016) Accurate text localization in natural image with cascaded convolutional textnetwork, pp 1–10. arXiv:1603.09423
  34. 34.
    Zhong Z, Jin L, Zhang S, Feng Z (2016) DeepText a unified framework for text proposal generation and text detectionin natural images, pp 1–12. arXiv:1605.07314v1
  35. 35.
    Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: IEEE conference on computer vision and pattern recognition, pp 2315–2324Google Scholar
  36. 36.
    Tian Z, Huang W, He T, He P, Qiao Y (2016) Detecting text in natural image with connectionist text proposal network. In: European conference on computer vision, pp 56–72Google Scholar
  37. 37.
    Liao M, Shi B, Bai X, Wang X, Liu W (2017) TextBoxes a fast text detector with a single deep neural network. In: Proceedings of association for the advancement of artificial intelligence, pp 1–7Google Scholar
  38. 38.
    Ma J, Shao W, Ye H, Wang L, Wang H (2017) Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans Multimed 20:1–9Google Scholar
  39. 39.
    Girshick R (2015) Fast R-CNN. In: IEEE international conference on computer vision, pp 1440–1448Google Scholar
  40. 40.
    He W, Zhang XY, Yin F, Liu CL (2017) Deep direct regression for multi-oriented scene text detection. In: IEEE international conference on computer vision, pp 745–753Google Scholar
  41. 41.
    Jiang Y, Zhu X, Wang X, Yang S, Li W (2017) R2CNN: rotational region cnn for orientation robust scene text detection, pp 1–8. arXiv:1706.09579
  42. 42.
    Liu Y, Jin L (2017) Deep matching prior network toward tighter multi-oriented text detection. In: IEEE conference on computer vision and pattern recognition, pp 3454–3461Google Scholar
  43. 43.
    Shi B, Bai X, Belongie S (2017) Detecting oriented text in natural images by linking segments. In: IEEE conference on computer vision and pattern recognition, pp 2482–3490Google Scholar
  44. 44.
    Zhou X, Yao C, Wen H, Wang Y, Zhou S (2017) EAST an efficient and accurate scene text detector. In: IEEE conference on computer vision and pattern recognition, pp 2642–2651Google Scholar
  45. 45.
    Liao M, Zhu Z, Shi B, Xia G, Bai X (2018) Rotation-sensitive regression for oriented scene text detection. In: IEEE conference on computer vision and pattern recognition, pp 1–10Google Scholar
  46. 46.
    He P, Huang W, He T, Zhu Q, Qiao Y (2017) Single shot text detector with regional attention. In: IEEE international conference on computer vision, pp 3047–3055Google Scholar
  47. 47.
    Zhong Z, Sun L, Huo Q (2018) An Anchor-Free Region proposal network for faster R-CNN based text detection approaches, pp 1–8. arXiv:1804.09003
  48. 48.
    Lyu P, Yao C, Wu W, Yan S, Bai X (2018) Multi-oriented scene text detection via corner localization and region segmentation. In: IEEE conference on computer vision and pattern recognition, pp 1–10Google Scholar
  49. 49.
    He T, Tian Z, Huang W, Shen C, Qiao Y (2018) Single shot text spotter with explicit alignment and attention. In: ieee conference on computer vision and pattern recognition, pp 1–10Google Scholar
  50. 50.
    He K, Gkioxari G, Dollar P, Girshick R (2018) Mask R-CNN. IEEE transactions on pattern analysis & machine intelligenceGoogle Scholar
  51. 51.
    Bissacco A, Cummins M, Netzer Y, Neven H (2013) Photo OCR: reading text in uncontrolled conditions. In: IEEE international conference on computer vision, pp 785–792Google Scholar
  52. 52.
    Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2014) Synthetic data and artificial neural networks for natural scene text recognition. In: Conference on neural information processing systems, pp 1–10Google Scholar
  53. 53.
    Yao C, Bai X, Liu W (2014) A unified framework for multi-oriented text detection and recognition. IEEE Trans Image Process 23:4737–4749MathSciNetCrossRefzbMATHGoogle Scholar
  54. 54.
    Almazan J, Gordo A, Fornes A, Valveny E (2015) Word spotting and recognition with embedded attributes. IEEE Trans Pattern Anal Mach Intell 36:2552–2566CrossRefGoogle Scholar
  55. 55.
    Shi B, Bai X, Yao C (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39:2298–2304CrossRefGoogle Scholar
  56. 56.
    Shi B, C Yao, C Zhang, X Guo (2015) Automatic script identification in the wild. In: International conference on document analysis and recognition, pp 531–535Google Scholar
  57. 57.
    Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2015) Deep structured output learning for unconstrained text recognition. In: International conference on learning representations, pp 1–10Google Scholar
  58. 58.
    He P, Huang W, Qiao Y, Loy CC, Tang X (2016) Reading scene text in deep convolutional sequences. In: Proceedings of association for the advancement of artificial intelligence, pp 1–8Google Scholar
  59. 59.
    Shi B, Wang X, Lyu P, Yao C, Bai X (2016) Robust scene text recognition with automatic rectification. In: IEEE conference on computer vision and pattern recognition, pp 1–9Google Scholar
  60. 60.
    Lee CY, Osindero S (2016) Recursive recurrent nets with attention modeling for OCR in the Wild. In: IEEE conference on computer vision and pattern recognition, pp 2231–2239Google Scholar
  61. 61.
    Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2016) Reading text in the wild with convolutional neural networks. Int J Comput Vis 116:1–20MathSciNetCrossRefGoogle Scholar
  62. 62.
    Lou X, Kansky K, Lehrach W, Laan V, Marthi B (2016) Generative shape models joint text recognition and segmentation with very little training data. In: Advances in Neural Information Processing Systems. Neural Information Processing Systems Foundation, BarcelonaGoogle Scholar
  63. 63.
    Kang C, Kim G, Yoo SI (2017) Detection and recognition of text embedded in online images via neural context models. In: Proceedings of association for the advancement of artificial intelligence, pp 4103–4110Google Scholar
  64. 64.
    B Moysset, C Kermorvant, C Wolf (2017) Full-Page Text Recognition Learning Where to Start and When to Stop. In: International Conference on Document Analysis and Recognition, pp 871-876Google Scholar
  65. 65.
    Yang C, Yin XC, Li Z, Wu J, Guo C(2017) AdaDNNs: adaptive ensemble of deep neural networks for scene text recognition, pp 1–8. arXiv:1710.03425
  66. 66.
    Bartz C, Yang H, Meinel C (2017) STN-OCR a single neural network for text detection and text recognition, pp 1-9. arXiv:1707.08831
  67. 67.
    Gomezbigorda L, Karatzas D (2017) TextProposals a text-specific selective search algorithm for word spotting in the wild. Pattern Recognit 70:60–74CrossRefGoogle Scholar
  68. 68.
    Bai F, Cheng Z, Niu Y, Pu S, Zhou S (2018) Edit probability for scene text recognition, pp 1–9. arXiv:1805.03384
  69. 69.
    Liu X, Liang D, Yan S, Chen D, Qiao Y (2018) FOTS Fast oriented text spotting with a unified network, pp 1-10. arXiv:1801.01671
  70. 70.
    Liao M, Shi B, Bai X (2018) TextBoxes ++ a single-shot oriented scene text detector. IEEE Trans Image Process 27:3676–3690MathSciNetCrossRefGoogle Scholar
  71. 71.
    Shi C, Wang C, Xiao B, Zhang Y, Gao S (2013) Scene text recognition using part-based tree-structured character detection. In: IEEE conference on computer vision and pattern recognition, pp 2961–2968Google Scholar
  72. 72.
    Bai X, Yao C, Liu W (2014) Strokelets a learned multi-scale representation for scene text recognition. In: IEEE conference on computer vision and pattern recognition, pp 4042–4049Google Scholar
  73. 73.
    Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: IEEE conference on computer vision and pattern recognition, pp 3431–3440Google Scholar
  74. 74.
    He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition, pp 1–12. arXiv:1512.03385
  75. 75.
    Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: IEEE conference on computer vision and pattern recognition, pp 779–788Google Scholar
  76. 76.
    Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: International Conference on Neural Information Processing Systems, pp 91–99Google Scholar
  77. 77.
    Xie S, Tu Z (2015) Holistically-nested edge detection. In: International Journal of Computer Vision, pp 1–16Google Scholar
  78. 78.
    Postl W (1986) Detection of linear oblique structures and skew scan in digitized documents. In: International Conference on Pattern Recognition, pp 687-689Google Scholar
  79. 79.
    Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp 379–387Google Scholar
  80. 80.
    Zhou Y, Ye Q, Qiu Q, Jiao J (2017) Oriented response networks. In: IEEE conference on computer vision and pattern recognition, pp 4961–4970Google Scholar
  81. 81.
    Yao C, Zhang X, Bai X, Liu W, Ma Y, Tu Z (2012) Detecting texts of arbitrary orientations in natural images. In: IEEE international conference on computer vision, pp 1083–1090Google Scholar
  82. 82.
    Karatzas D, Antonacopoulos A (2004) Text extraction from web images based on a split-and-merge segmentation method using colour perception. In: International conference on pattern recognition, pp 634–637Google Scholar
  83. 83.
    Rajendran D, Shivakumara P, Su B, Lu S, Tan CL (2011) A new Fourier-moments based video word and character extraction method for recognition. In: International conference on document analysis and recognition, pp 1165–1169Google Scholar
  84. 84.
    Sharma N, Shivakumara P, Pal U, Blumenstein M, Tan CL (2012) A new method for arbitrarily-oriented text detection in video. In: Proceedings of the IAPR international workshop on document analysis systems, pp 74–78Google Scholar
  85. 85.
    Shivakumara P, Sreedhar R, Phan T, Lu S, Tan CL (2012) Multioriented video scene text detection through Bayesian classification and boundary growing. IEEE Trans Circuits Syst Video Technol 22:1227–1235CrossRefGoogle Scholar
  86. 86.
    Singh C, Bhatia N, Kaur A (2008) Hough transform based fast skew detection and accurate skew correction methods. Pattern Recognit 41:3528–3546CrossRefzbMATHGoogle Scholar
  87. 87.
    Yi C, Tian Y (2011) Text string detection from natural scenes by structure-based partition and grouping. IEEE Trans Image Process 20:2594–2605MathSciNetCrossRefzbMATHGoogle Scholar
  88. 88.
    Shivakumara P, Phan TQ, Tan CL (2011) A Laplacian approach to multi-oriented text detection in video. IEEE Trans Pattern Anal Mach Intell 33:412–419CrossRefGoogle Scholar
  89. 89.
    Pan YF, Hou X, Liu CL (2011) A hybrid approach to detect and localize texts in natural scene images. IEEE Trans Image Process 20:800–813MathSciNetCrossRefzbMATHGoogle Scholar
  90. 90.
    Alsharif O, Pineau J (2013) End-to-end text recognition with hybrid HMM Maxout models. sys, pp 1–10. arXiv:1310.1811v1
  91. 91.
    Jawahar CV, Alahari K, Mishra A (2012) Top-down and bottom-up cues for scene text recognition. In: IEEE conference on computer vision and pattern recognition, pp 2687–2694Google Scholar
  92. 92.
    Novikova T, Barinova O, Kohli P, Lempitsky V (2012) Large-lexicon attribute-consistent text recognition in natural images. In: European conference on computer vision, pp 752–765Google Scholar
  93. 93.
    Graves A, Gomez F (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: International conference on machine learning, pp 369–376Google Scholar
  94. 94.
  95. 95.
  96. 96.
  97. 97.
  98. 98.
  99. 99.
  100. 100.
  101. 101.
  102. 102.
    http://rctw.vlrlab.net/. Accessed July 2018
  103. 103.
  104. 104.
  105. 105.
  106. 106.
    Tian S, Lu S, Li C (2017) WeText scene text detection under weak supervision. In: IEEE international conference on computer vision, pp 1501–1509Google Scholar
  107. 107.
    Hu H, Zhang C, Luo Y, Wang Y, Han J(2017) WordSup: exploiting word annotations for character based text detection. In: IEEE international conference on computer vision, pp 4950–4959Google Scholar
  108. 108.
    Weinman JJ, Butler Z, Knoll D, Field J (2014) Toward integrated scene text reading. IEEE Trans Pattern Anal Mach Intell 36:375–387CrossRefGoogle Scholar
  109. 109.
    Bai X, Yao C, Liu W (2016) Strokelets: a learned multi-scale mid-level representation for scene text recognition. IEEE Trans Image Process 25:2789–2802MathSciNetCrossRefGoogle Scholar

Copyright information

© CIMNE, Barcelona, Spain 2019

Authors and Affiliations

  1. 1.School of Information EngineeringNanjing Audit UniversityJiangshuChina
  2. 2.School of Information EngineeringNanchang Hangkong UniversityJiangxiChina

Personalised recommendations