Skip to main content

RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Abstract

The attention-based encoder-decoder framework has recently achieved impressive results for scene text recognition, and many variants have emerged with improvements in recognition quality. However, it performs poorly on contextless texts (e.g., random character sequences) which is unacceptable in most of real application scenarios. In this paper, we first deeply investigate the decoding process of the decoder. We empirically find that a representative character-level sequence decoder utilizes not only context information but also positional information. Contextual information, which the existing approaches heavily rely on, causes the problem of attention drift. To suppress such side-effect, we propose a novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition. Specifically, it contains a position aware module to enable the encoder to output feature vectors encoding their own spatial positions, and an attention module to estimate glimpses using the positional clue (i.e., the current decoding time step) only. The dynamic fusion is conducted for more robust feature via an element-wise gate mechanism. Theoretically, our proposed method, dubbed RobustScanner, decodes individual characters with dynamic ratio between context and positional clues, and utilizes more positional ones when the decoding sequences with scarce context, and thus is robust and practical. Empirically, it has achieved new state-of-the-art results on popular regular and irregular text recognition benchmarks while without much performance drop on contextless benchmarks, validating its robustness in both contextual and contextless application scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Baek, J., et al.: What is wrong with scene text recognition model comparisons? Dataset and model analysis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4715–4723 (2019)

    Google Scholar 

  2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR, pp. 1–15 (2015)

    Google Scholar 

  3. Bai, F., Cheng, Z., Niu, Y., Pu, S., Zhou, S.: Edit probability for scene text recognition. In: CVPR, pp. 1508–1516 (2018)

    Google Scholar 

  4. Bartz, C., Bethge, J., Yang, H., Meinel, C.: KISS: keeping it simple for scene text recognition. arXiv preprint arXiv:1911.08400 (2019)

  5. Biten, A.F., et al.: Scene text visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4291–4301 (2019)

    Google Scholar 

  6. Bleeker, M., de Rijke, M.: Bidirectional scene text recognition with a single decoder. arXiv preprint arXiv:1912.03656 (2019)

  7. Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: towards accurate text recognition in natural images. In: ICCV, pp. 5076–5084 (2017)

    Google Scholar 

  8. Cheng, Z., Xu, Y., Bai, F., Niu, Y., Pu, S., Zhou, S.: AON: towards arbitrarily-oriented text recognition. In: CVPR, pp. 5571–5579 (2018)

    Google Scholar 

  9. Dai, Z., et al.: Transformer-XL: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019)

  10. Gao, Y., Chen, Y., Wang, J., Lei, Z., Zhang, X.Y., Lu, H.: Recurrent calibration network for irregular text recognition. arXiv preprint arXiv:1812.07145 (2018)

  11. Gao, Y., Chen, Y., Wang, J., Lu, H.: Reading scene text with attention convolutional sequence modeling. arXiv preprint arXiv:1709.04303 (2017)

  12. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)

    Google Scholar 

  13. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR (2016)

    Google Scholar 

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  15. He, P., Huang, W., Qiao, Y., Loy, C.C., Tang, X.: Reading scene text in deep convolutional sequences. In: AAAI (2016)

    Google Scholar 

  16. He, W., Yin, F., Zhang, X.-Y., Liu, C.-L.: TextDragon: an end-to-end framework for arbitrary shaped text spotting. In: ICCV, pp. 9076–9085 (2019)

    Google Scholar 

  17. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)

  18. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: NIPS, pp. 2017–2025 (2015)

    Google Scholar 

  19. Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: ICDAR, pp. 1156–1160. IEEE (2015)

    Google Scholar 

  20. Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: ICDAR, pp. 1484–1493 (2013)

    Google Scholar 

  21. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  22. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: 4th International IEEE Workshop on 3D Representation and Recognition (3dRR 2013), Sydney, Australia (2013)

    Google Scholar 

  23. Lee, C.Y., Bhardwaj, A., Di, W., Jagadeesh, V., Piramuthu, R.: Region-based discriminative feature pooling for scene text recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4050–4057 (2014)

    Google Scholar 

  24. Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild. In: CVPR, pp. 2231–2239 (2016)

    Google Scholar 

  25. Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: a simple and strong baseline for irregular text recognition. In: AAAI (2019)

    Google Scholar 

  26. Liao, M., et al.: Scene text recognition from two-dimensional perspective. arXiv preprint arXiv:1809.06508 (2018)

  27. Liu, W., Chen, C., Wong, K.Y.K.: Char-Net: a character-aware neural network for distorted scene text recognition. In: AAAI (2018)

    Google Scholar 

  28. Long, S., Guan, Y., Bian, K., Yao, C.: A new perspective for flexible feature gathering in scene text recognition via character anchor pooling. arXiv preprint arXiv:2002.03509 (2020)

  29. Luo, C., Jin, L., Sun, Z.: MORAN: a multi-object rectified attention network for scene text recognition. Pattern Recogn. 90, 109–118 (2019)

    Article  Google Scholar 

  30. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: EMNLP (2015)

    Google Scholar 

  31. Lyu, P., Yang, Z., Leng, X., Wu, X., Li, R., Shen, X.: 2D attentional irregular scene text recognizer. arXiv preprint arXiv:1906.05708 (2019)

  32. Mishra, A., Alahari, K., Jawahar, C.V.: Scene text recognition using higher order language priors. In: BMVC-British Machine Vision Conference. BMVA (2012)

    Google Scholar 

  33. Mishra, A., Alahari, K., Jawahar, C.V.: Top-down and bottom-up cues for scene text recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2687–2694 (2012)

    Google Scholar 

  34. Mishra, A., Alahari, K., Jawahar, C.V.: Enhancing energy minimization framework for scene text recognition with top-down cues. Comput. Vis. Image Underst. 145, 30–42 (2016)

    Article  Google Scholar 

  35. Qin, S., Bissacco, A., Raptis, M., Fujii, Y., Xiao, Y.: Towards unconstrained end-to-end text spotting. In: ICCV (2019)

    Google Scholar 

  36. Quy Phan, T., Shivakumara, P., Tian, S., Lim Tan, C.: Recognizing text with perspective distortion in natural scenes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 569–576 (2013)

    Google Scholar 

  37. Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027–8048 (2014)

    Article  Google Scholar 

  38. Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018)

  39. Sheng, F., Chen, Z., Xu, B.: NRTR: a no-recurrence sequence-to-sequence model for scene text recognition. arXiv preprint (2017)

    Google Scholar 

  40. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. PAMI 39(11), 2298–2304 (2016)

    Article  Google Scholar 

  41. Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: CVPR, pp. 4168–4176 (2016)

    Google Scholar 

  42. Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: an attentional scene text recognizer with flexible rectification. PAMI 41(9), 2035–2048 (2018)

    Article  Google Scholar 

  43. Silva, S.M., Jung, C.R.: License plate detection and recognition in unconstrained scenarios. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 593–609. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01258-8_36

    Chapter  Google Scholar 

  44. Singh, A., et al.: Towards VQA models that can read. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8317–8326 (2019)

    Google Scholar 

  45. Sun, Y., Liu, J., Liu, W., Han, J., Ding, E., Liu, J.: Chinese street view text: large-scale Chinese text reading with partially supervised learning. In: ICCV, pp. 9086–9095 (2019)

    Google Scholar 

  46. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014)

    Google Scholar 

  47. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  48. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: 2011 International Conference on Computer Vision, pp. 1457–1464 (2011)

    Google Scholar 

  49. Wang, K., Belongie, S.: Word spotting in the wild. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 591–604. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_43

    Chapter  Google Scholar 

  50. Wang, P., Yang, L., Li, H., Deng, Y., Shen, C., Zhang, Y.: A simple and robust convolutional-attention network for irregular text recognition. aXiv preprint (2019)

    Google Scholar 

  51. Wang, T., et al.: Decoupled attention network for text recognition. In: AAAI Conference on Artificial Intelligence (2020)

    Google Scholar 

  52. Xie, Z., Huang, Y., Zhu, Y., Jin, L., Liu, Y., Xie, L.: Aggregation cross-entropy for sequence recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6538–6547 (2019)

    Google Scholar 

  53. Xing, L., Tian, Z., Huang, W., Scott, M.R.: Convolutional character networks. In: ICCV, pp. 9126–9136 (2019)

    Google Scholar 

  54. Xu, K., Courville, A., Zemel, R.S., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)

    Google Scholar 

  55. Yang, M., et al.: Symmetry-constrained rectification network for scene text recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9147–9156 (2019)

    Google Scholar 

  56. Yang, X., He, D., Zhou, Z., Kifer, D., Giles, C.L.: Learning to read irregular text with attention mechanisms. In: IJCAI (2017)

    Google Scholar 

  57. Zhan, F., Lu, S.: ESIR: end-to-end scene text recognition via iterative image rectification. arXiv preprint arXiv:1812.05824 (2018)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhanghui Kuang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 273 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yue, X., Kuang, Z., Lin, C., Sun, H., Zhang, W. (2020). RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12364. Springer, Cham. https://doi.org/10.1007/978-3-030-58529-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58529-7_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58528-0

  • Online ISBN: 978-3-030-58529-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics