Pure Transformer with Integrated Experts for Scene Text Recognition

Tan, Yew Lee; Kong, Adams Wai-Kin; Kim, Jung-Jae

doi:10.1007/978-3-031-19815-1_28

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13688))

Included in the following conference series:

European Conference on Computer Vision

2194 Accesses
3 Citations
2 Altmetric

Abstract

Scene text recognition (STR) involves the task of reading text in cropped images of natural scenes. Conventional models in STR employ convolutional neural network (CNN) followed by recurrent neural network in an encoder-decoder framework. In recent times, the transformer architecture is being widely adopted in STR as it shows strong capability in capturing long-term dependency which appears to be prominent in scene text images. Many researchers utilized transformer as part of a hybrid CNN-transformer encoder, often followed by a transformer decoder. However, such methods only make use of the long-term dependency mid-way through the encoding process. Although the vision transformer (ViT) is able to capture such dependency at an early stage, its utilization remains largely unexploited in STR. This work proposes the use of a transformer-only model as a simple baseline which outperforms hybrid CNN-transformer models. Furthermore, two key areas for improvement were identified. Firstly, the first decoded character has the lowest prediction accuracy. Secondly, images of different original aspect ratios react differently to the patch resolutions while ViT only employ one fixed patch resolution. To explore these areas, Pure Transformer with Integrated Experts (PTIE) is proposed. PTIE is a transformer model that can process multiple patch resolutions and decode in both the original and reverse character orders. It is examined on 7 commonly used benchmarks and compared with over 20 state-of-the-art methods. The experimental results show that the proposed method outperforms them and obtains state-of-the-art results in most benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Atienza, R.: Vision transformer for fast and efficient scene text recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 319–334. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_21
Chapter Google Scholar
Baek, J., Matsui, Y., Aizawa, K.: What if we only use real datasets for scene text recognition? Toward scene text recognition with fewer labels. In: CVPR, pp. 3113–3122 (2021)
Google Scholar
Bartz, C., Bethge, J., Yang, H., Meinel, C.: Kiss: keeping it simple for scene text recognition. arXiv preprint arXiv:1911.08400 (2019)
Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: GCNet: non-local networks meet squeeze-excitation networks and beyond. In: ICCVW (2019)
Google Scholar
Chen, X., Jin, L., Zhu, Y., Luo, C., Wang, T.: Text recognition in the wild: a survey. ACM Comput. Surv. 54(2), 1–35 (2021)
Article Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Ezaki, N., Kiyota, K., Minh, B.T., Bulacu, M., Schomaker, L.: Improved text-detection methods for a camera-based text reading system for blind persons. In: ICDAR, pp. 257–261 (2005)
Google Scholar
Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. In: CVPR, pp. 7098–7107 (2021)
Google Scholar
Fu, Z., Xie, H., Jin, G., Guo, J.: Look back again: dual parallel attention network for accurate and robust scene text recognition. In: ICMR, pp. 638–644 (2021)
Google Scholar
Gomez, R., et al.: ICDAR 2017 robust reading challenge on coco-text. In: ICDAR, vol. 1, pp. 1435–1443 (2017)
Google Scholar
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315–2324 (2016). https://doi.org/10.1109/CVPR.2016.254
Gurari, D., et al.: Vizwiz grand challenge: answering visual questions from blind people. In: CVPR, pp. 3608–3617 (2018)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hu, W., Cai, X., Hou, J., Yi, S., Lin, Z.: GTC: guided training of CTC towards efficient and accurate scene text recognition. In: AAAI, vol. 34, pp. 11005–11012 (2020)
Google Scholar
Huang, D., Lang, Y., Liu, T.: Evolving population distribution in China’s border regions: spatial differences, driving forces and policy implications. PLoS ONE 15(10), e0240592 (2020)
Article Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: ICDAR, pp. 1156–1160 (2015)
Google Scholar
Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: ICDAR, pp. 1484–1493 (2013)
Google Scholar
Ke, G., He, D., Liu, T.Y.: Rethinking positional encoding in language pre-training. In: ICLR (2020)
Google Scholar
Lee, J., Park, S., Baek, J., Oh, S.J., Kim, S., Lee, H.: On recognizing texts of arbitrary shapes with 2D self-attention. In: CVPRW, pp. 546–547 (2020)
Google Scholar
Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: a simple and strong baseline for irregular text recognition. In: AAAI, vol. 33, pp. 8610–8617, July 2019
Google Scholar
Long, S., He, X., Yao, C.: Scene text detection and recognition: The deep learning era. IJCV 129(1), 161–184 (2021)
Article Google Scholar
Lu, N., et al.: Master: multi-aspect non-local network for scene text recognition. PR 117, 107980 (2021)
Google Scholar
Luo, C., Jin, L., Sun, Z.: Moran: a multi-object rectified attention network for scene text recognition. PR 90, 109–118 (2019)
Google Scholar
Luo, C., Lin, Q., Liu, Y., Jin, L., Shen, C.: Separating content from style using adversarial learning for recognizing text in the wild. IJCV 129(4), 960–976 (2021)
Article MathSciNet Google Scholar
Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: BMVC (2012)
Google Scholar
Nayef, N., et al.: ICDAR 2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In: ICDAR, pp. 1582–1587. IEEE (2019)
Google Scholar
Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: ICCV, pp. 569–576 (2013). https://doi.org/10.1109/ICCV.2013.76
Qiao, Z., et al.: Pimnet: a parallel, iterative and mimicking network for scene text recognition. In: ACMMM, pp. 2046–2055 (2021)
Google Scholar
Qiao, Z., Zhou, Y., Yang, D., Zhou, Y., Wang, W.: Seed: semantics enhanced encoder-decoder framework for scene text recognition. In: CVPR, June 2020
Google Scholar
Raisi, Z., Naiel, M.A., Younes, G., Wardell, S., Zelek, J.: 2LSPE: 2D learnable sinusoidal positional encoding using transformer for scene text recognition. In: CRV, pp. 119–126 (2021)
Google Scholar
Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027–8048 (2014). https://doi.org/10.1016/j.eswa.2014.07.008. https://www.sciencedirect.com/science/article/pii/S0957417414004060
Schulz, R., et al.: Robot navigation using human cues: a robot navigation system for symbolic goal-directed exploration. In: ICRA, pp. 1100–1105 (2015)
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. PAMI 39(11), 2298–2304 (2016)
Article Google Scholar
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: an attentional scene text recognizer with flexible rectification. PAMI 41(9), 2035–2048 (2018)
Article Google Scholar
Tsai, S.S., Chen, H., Chen, D., Schroth, G., Grzeszczuk, R., Girod, B.: Mobile visual search on printed documents using text and low bit-rate features. In: ICIP, pp. 2601–2604 (2011)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS, vol. 30 (2017)
Google Scholar
Wan, Z., He, M., Chen, H., Bai, X., Yao, C.: Textscanner: reading characters in order for robust scene text recognition. In: AAAI, vol. 34, pp. 12120–12127, April 2020
Google Scholar
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464 (2011)
Google Scholar
Wang, T., et al.: Decoupled attention network for text recognition. In: AAAI, vol. 34, pp. 12216–12224, April 2020. https://doi.org/10.1609/aaai.v34i07.6903
Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S., Zhang, Y.: From two to one: a new scene text recognizer with visual language modeling network. In: ICCV, pp. 14194–14203 (2021)
Google Scholar
Wu, L., Liu, X., Hao, Y., Ma, Y., Hong, R.: Naster: non-local attentional scene text recognizer. In: ICMR, pp. 331–338 (2021)
Google Scholar
Yan, R., Peng, L., Xiao, S., Yao, G.: Primitive representation learning for scene text recognition. In: CVPR, pp. 284–293 (2021)
Google Scholar
Yang, M., et al.: Symmetry-constrained rectification network for scene text recognition. In: ICCV, October 2019
Google Scholar
Yu, D., et al.: Towards accurate scene text recognition with semantic reasoning networks. In: CVPR, pp. 12113–12122 (2020)
Google Scholar
Yue, X., Kuang, Z., Lin, C., Sun, H., Zhang, W.: RobustScanner: dynamically enhancing positional clues for robust text recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 135–151. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_9
Chapter Google Scholar
Zhan, F., Lu, S.: ESIR: end-to-end scene text recognition via iterative image rectification. In: CVPR, June 2019
Google Scholar
Zhang, C., et al.: Spin: structure-preserving inner offset network for scene text recognition. In: AAAI, vol. 35, pp. 3305–3314 (2021)
Google Scholar
Zhang, H., Yao, Q., Yang, M., Xu, Y., Bai, X.: AutoSTR: efficient backbone search for scene text recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 751–767. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_44
Chapter Google Scholar
Zhang, M., Ma, M., Wang, P.: Scene text recognition with cascade attention network. In: ICMR, pp. 385–393 (2021)
Google Scholar

Download references

Acknowledgments

This work is partially supported by NTU Internal Funding - Accelerating Creativity and Excellence (NTU-ACE2020-03).

Author information

Authors and Affiliations

Nanyang Technological University, Singapore, Singapore
Yew Lee Tan & Adams Wai-Kin Kong
Institute for Infocomm Research, A*STAR, Singapore, Singapore
Jung-Jae Kim

Authors

Yew Lee Tan
View author publications
You can also search for this author in PubMed Google Scholar
Adams Wai-Kin Kong
View author publications
You can also search for this author in PubMed Google Scholar
Jung-Jae Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yew Lee Tan .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1605 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tan, Y.L., Kong, A.WK., Kim, JJ. (2022). Pure Transformer with Integrated Experts for Scene Text Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13688. Springer, Cham. https://doi.org/10.1007/978-3-031-19815-1_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-19815-1_28
Published: 20 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19814-4
Online ISBN: 978-3-031-19815-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Pure Transformer with Integrated Experts for Scene Text Recognition