Skip to main content

Pure Transformer with Integrated Experts for Scene Text Recognition

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Scene text recognition (STR) involves the task of reading text in cropped images of natural scenes. Conventional models in STR employ convolutional neural network (CNN) followed by recurrent neural network in an encoder-decoder framework. In recent times, the transformer architecture is being widely adopted in STR as it shows strong capability in capturing long-term dependency which appears to be prominent in scene text images. Many researchers utilized transformer as part of a hybrid CNN-transformer encoder, often followed by a transformer decoder. However, such methods only make use of the long-term dependency mid-way through the encoding process. Although the vision transformer (ViT) is able to capture such dependency at an early stage, its utilization remains largely unexploited in STR. This work proposes the use of a transformer-only model as a simple baseline which outperforms hybrid CNN-transformer models. Furthermore, two key areas for improvement were identified. Firstly, the first decoded character has the lowest prediction accuracy. Secondly, images of different original aspect ratios react differently to the patch resolutions while ViT only employ one fixed patch resolution. To explore these areas, Pure Transformer with Integrated Experts (PTIE) is proposed. PTIE is a transformer model that can process multiple patch resolutions and decode in both the original and reverse character orders. It is examined on 7 commonly used benchmarks and compared with over 20 state-of-the-art methods. The experimental results show that the proposed method outperforms them and obtains state-of-the-art results in most benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Atienza, R.: Vision transformer for fast and efficient scene text recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 319–334. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_21

    Chapter  Google Scholar 

  2. Baek, J., Matsui, Y., Aizawa, K.: What if we only use real datasets for scene text recognition? Toward scene text recognition with fewer labels. In: CVPR, pp. 3113–3122 (2021)

    Google Scholar 

  3. Bartz, C., Bethge, J., Yang, H., Meinel, C.: Kiss: keeping it simple for scene text recognition. arXiv preprint arXiv:1911.08400 (2019)

  4. Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: GCNet: non-local networks meet squeeze-excitation networks and beyond. In: ICCVW (2019)

    Google Scholar 

  5. Chen, X., Jin, L., Zhu, Y., Luo, C., Wang, T.: Text recognition in the wild: a survey. ACM Comput. Surv. 54(2), 1–35 (2021)

    Article  Google Scholar 

  6. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  7. Ezaki, N., Kiyota, K., Minh, B.T., Bulacu, M., Schomaker, L.: Improved text-detection methods for a camera-based text reading system for blind persons. In: ICDAR, pp. 257–261 (2005)

    Google Scholar 

  8. Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. In: CVPR, pp. 7098–7107 (2021)

    Google Scholar 

  9. Fu, Z., Xie, H., Jin, G., Guo, J.: Look back again: dual parallel attention network for accurate and robust scene text recognition. In: ICMR, pp. 638–644 (2021)

    Google Scholar 

  10. Gomez, R., et al.: ICDAR 2017 robust reading challenge on coco-text. In: ICDAR, vol. 1, pp. 1435–1443 (2017)

    Google Scholar 

  11. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315–2324 (2016). https://doi.org/10.1109/CVPR.2016.254

  12. Gurari, D., et al.: Vizwiz grand challenge: answering visual questions from blind people. In: CVPR, pp. 3608–3617 (2018)

    Google Scholar 

  13. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  14. Hu, W., Cai, X., Hou, J., Yi, S., Lin, Z.: GTC: guided training of CTC towards efficient and accurate scene text recognition. In: AAAI, vol. 34, pp. 11005–11012 (2020)

    Google Scholar 

  15. Huang, D., Lang, Y., Liu, T.: Evolving population distribution in China’s border regions: spatial differences, driving forces and policy implications. PLoS ONE 15(10), e0240592 (2020)

    Article  Google Scholar 

  16. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)

  17. Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: ICDAR, pp. 1156–1160 (2015)

    Google Scholar 

  18. Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: ICDAR, pp. 1484–1493 (2013)

    Google Scholar 

  19. Ke, G., He, D., Liu, T.Y.: Rethinking positional encoding in language pre-training. In: ICLR (2020)

    Google Scholar 

  20. Lee, J., Park, S., Baek, J., Oh, S.J., Kim, S., Lee, H.: On recognizing texts of arbitrary shapes with 2D self-attention. In: CVPRW, pp. 546–547 (2020)

    Google Scholar 

  21. Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: a simple and strong baseline for irregular text recognition. In: AAAI, vol. 33, pp. 8610–8617, July 2019

    Google Scholar 

  22. Long, S., He, X., Yao, C.: Scene text detection and recognition: The deep learning era. IJCV 129(1), 161–184 (2021)

    Article  Google Scholar 

  23. Lu, N., et al.: Master: multi-aspect non-local network for scene text recognition. PR 117, 107980 (2021)

    Google Scholar 

  24. Luo, C., Jin, L., Sun, Z.: Moran: a multi-object rectified attention network for scene text recognition. PR 90, 109–118 (2019)

    Google Scholar 

  25. Luo, C., Lin, Q., Liu, Y., Jin, L., Shen, C.: Separating content from style using adversarial learning for recognizing text in the wild. IJCV 129(4), 960–976 (2021)

    Article  MathSciNet  Google Scholar 

  26. Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: BMVC (2012)

    Google Scholar 

  27. Nayef, N., et al.: ICDAR 2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In: ICDAR, pp. 1582–1587. IEEE (2019)

    Google Scholar 

  28. Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: ICCV, pp. 569–576 (2013). https://doi.org/10.1109/ICCV.2013.76

  29. Qiao, Z., et al.: Pimnet: a parallel, iterative and mimicking network for scene text recognition. In: ACMMM, pp. 2046–2055 (2021)

    Google Scholar 

  30. Qiao, Z., Zhou, Y., Yang, D., Zhou, Y., Wang, W.: Seed: semantics enhanced encoder-decoder framework for scene text recognition. In: CVPR, June 2020

    Google Scholar 

  31. Raisi, Z., Naiel, M.A., Younes, G., Wardell, S., Zelek, J.: 2LSPE: 2D learnable sinusoidal positional encoding using transformer for scene text recognition. In: CRV, pp. 119–126 (2021)

    Google Scholar 

  32. Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027–8048 (2014). https://doi.org/10.1016/j.eswa.2014.07.008. https://www.sciencedirect.com/science/article/pii/S0957417414004060

  33. Schulz, R., et al.: Robot navigation using human cues: a robot navigation system for symbolic goal-directed exploration. In: ICRA, pp. 1100–1105 (2015)

    Google Scholar 

  34. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. PAMI 39(11), 2298–2304 (2016)

    Article  Google Scholar 

  35. Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: an attentional scene text recognizer with flexible rectification. PAMI 41(9), 2035–2048 (2018)

    Article  Google Scholar 

  36. Tsai, S.S., Chen, H., Chen, D., Schroth, G., Grzeszczuk, R., Girod, B.: Mobile visual search on printed documents using text and low bit-rate features. In: ICIP, pp. 2601–2604 (2011)

    Google Scholar 

  37. Vaswani, A., et al.: Attention is all you need. In: NIPS, vol. 30 (2017)

    Google Scholar 

  38. Wan, Z., He, M., Chen, H., Bai, X., Yao, C.: Textscanner: reading characters in order for robust scene text recognition. In: AAAI, vol. 34, pp. 12120–12127, April 2020

    Google Scholar 

  39. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464 (2011)

    Google Scholar 

  40. Wang, T., et al.: Decoupled attention network for text recognition. In: AAAI, vol. 34, pp. 12216–12224, April 2020. https://doi.org/10.1609/aaai.v34i07.6903

  41. Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S., Zhang, Y.: From two to one: a new scene text recognizer with visual language modeling network. In: ICCV, pp. 14194–14203 (2021)

    Google Scholar 

  42. Wu, L., Liu, X., Hao, Y., Ma, Y., Hong, R.: Naster: non-local attentional scene text recognizer. In: ICMR, pp. 331–338 (2021)

    Google Scholar 

  43. Yan, R., Peng, L., Xiao, S., Yao, G.: Primitive representation learning for scene text recognition. In: CVPR, pp. 284–293 (2021)

    Google Scholar 

  44. Yang, M., et al.: Symmetry-constrained rectification network for scene text recognition. In: ICCV, October 2019

    Google Scholar 

  45. Yu, D., et al.: Towards accurate scene text recognition with semantic reasoning networks. In: CVPR, pp. 12113–12122 (2020)

    Google Scholar 

  46. Yue, X., Kuang, Z., Lin, C., Sun, H., Zhang, W.: RobustScanner: dynamically enhancing positional clues for robust text recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 135–151. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_9

    Chapter  Google Scholar 

  47. Zhan, F., Lu, S.: ESIR: end-to-end scene text recognition via iterative image rectification. In: CVPR, June 2019

    Google Scholar 

  48. Zhang, C., et al.: Spin: structure-preserving inner offset network for scene text recognition. In: AAAI, vol. 35, pp. 3305–3314 (2021)

    Google Scholar 

  49. Zhang, H., Yao, Q., Yang, M., Xu, Y., Bai, X.: AutoSTR: efficient backbone search for scene text recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 751–767. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_44

    Chapter  Google Scholar 

  50. Zhang, M., Ma, M., Wang, P.: Scene text recognition with cascade attention network. In: ICMR, pp. 385–393 (2021)

    Google Scholar 

Download references

Acknowledgments

This work is partially supported by NTU Internal Funding - Accelerating Creativity and Excellence (NTU-ACE2020-03).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yew Lee Tan .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1605 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tan, Y.L., Kong, A.WK., Kim, JJ. (2022). Pure Transformer with Integrated Experts for Scene Text Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13688. Springer, Cham. https://doi.org/10.1007/978-3-031-19815-1_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19815-1_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19814-4

  • Online ISBN: 978-3-031-19815-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics