Abstract
Scene text recognition (STR) involves the task of reading text in cropped images of natural scenes. Conventional models in STR employ convolutional neural network (CNN) followed by recurrent neural network in an encoder-decoder framework. In recent times, the transformer architecture is being widely adopted in STR as it shows strong capability in capturing long-term dependency which appears to be prominent in scene text images. Many researchers utilized transformer as part of a hybrid CNN-transformer encoder, often followed by a transformer decoder. However, such methods only make use of the long-term dependency mid-way through the encoding process. Although the vision transformer (ViT) is able to capture such dependency at an early stage, its utilization remains largely unexploited in STR. This work proposes the use of a transformer-only model as a simple baseline which outperforms hybrid CNN-transformer models. Furthermore, two key areas for improvement were identified. Firstly, the first decoded character has the lowest prediction accuracy. Secondly, images of different original aspect ratios react differently to the patch resolutions while ViT only employ one fixed patch resolution. To explore these areas, Pure Transformer with Integrated Experts (PTIE) is proposed. PTIE is a transformer model that can process multiple patch resolutions and decode in both the original and reverse character orders. It is examined on 7 commonly used benchmarks and compared with over 20 state-of-the-art methods. The experimental results show that the proposed method outperforms them and obtains state-of-the-art results in most benchmarks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Atienza, R.: Vision transformer for fast and efficient scene text recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 319–334. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_21
Baek, J., Matsui, Y., Aizawa, K.: What if we only use real datasets for scene text recognition? Toward scene text recognition with fewer labels. In: CVPR, pp. 3113–3122 (2021)
Bartz, C., Bethge, J., Yang, H., Meinel, C.: Kiss: keeping it simple for scene text recognition. arXiv preprint arXiv:1911.08400 (2019)
Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: GCNet: non-local networks meet squeeze-excitation networks and beyond. In: ICCVW (2019)
Chen, X., Jin, L., Zhu, Y., Luo, C., Wang, T.: Text recognition in the wild: a survey. ACM Comput. Surv. 54(2), 1–35 (2021)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Ezaki, N., Kiyota, K., Minh, B.T., Bulacu, M., Schomaker, L.: Improved text-detection methods for a camera-based text reading system for blind persons. In: ICDAR, pp. 257–261 (2005)
Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. In: CVPR, pp. 7098–7107 (2021)
Fu, Z., Xie, H., Jin, G., Guo, J.: Look back again: dual parallel attention network for accurate and robust scene text recognition. In: ICMR, pp. 638–644 (2021)
Gomez, R., et al.: ICDAR 2017 robust reading challenge on coco-text. In: ICDAR, vol. 1, pp. 1435–1443 (2017)
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315–2324 (2016). https://doi.org/10.1109/CVPR.2016.254
Gurari, D., et al.: Vizwiz grand challenge: answering visual questions from blind people. In: CVPR, pp. 3608–3617 (2018)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Hu, W., Cai, X., Hou, J., Yi, S., Lin, Z.: GTC: guided training of CTC towards efficient and accurate scene text recognition. In: AAAI, vol. 34, pp. 11005–11012 (2020)
Huang, D., Lang, Y., Liu, T.: Evolving population distribution in China’s border regions: spatial differences, driving forces and policy implications. PLoS ONE 15(10), e0240592 (2020)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: ICDAR, pp. 1156–1160 (2015)
Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: ICDAR, pp. 1484–1493 (2013)
Ke, G., He, D., Liu, T.Y.: Rethinking positional encoding in language pre-training. In: ICLR (2020)
Lee, J., Park, S., Baek, J., Oh, S.J., Kim, S., Lee, H.: On recognizing texts of arbitrary shapes with 2D self-attention. In: CVPRW, pp. 546–547 (2020)
Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: a simple and strong baseline for irregular text recognition. In: AAAI, vol. 33, pp. 8610–8617, July 2019
Long, S., He, X., Yao, C.: Scene text detection and recognition: The deep learning era. IJCV 129(1), 161–184 (2021)
Lu, N., et al.: Master: multi-aspect non-local network for scene text recognition. PR 117, 107980 (2021)
Luo, C., Jin, L., Sun, Z.: Moran: a multi-object rectified attention network for scene text recognition. PR 90, 109–118 (2019)
Luo, C., Lin, Q., Liu, Y., Jin, L., Shen, C.: Separating content from style using adversarial learning for recognizing text in the wild. IJCV 129(4), 960–976 (2021)
Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: BMVC (2012)
Nayef, N., et al.: ICDAR 2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In: ICDAR, pp. 1582–1587. IEEE (2019)
Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: ICCV, pp. 569–576 (2013). https://doi.org/10.1109/ICCV.2013.76
Qiao, Z., et al.: Pimnet: a parallel, iterative and mimicking network for scene text recognition. In: ACMMM, pp. 2046–2055 (2021)
Qiao, Z., Zhou, Y., Yang, D., Zhou, Y., Wang, W.: Seed: semantics enhanced encoder-decoder framework for scene text recognition. In: CVPR, June 2020
Raisi, Z., Naiel, M.A., Younes, G., Wardell, S., Zelek, J.: 2LSPE: 2D learnable sinusoidal positional encoding using transformer for scene text recognition. In: CRV, pp. 119–126 (2021)
Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027–8048 (2014). https://doi.org/10.1016/j.eswa.2014.07.008. https://www.sciencedirect.com/science/article/pii/S0957417414004060
Schulz, R., et al.: Robot navigation using human cues: a robot navigation system for symbolic goal-directed exploration. In: ICRA, pp. 1100–1105 (2015)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. PAMI 39(11), 2298–2304 (2016)
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: an attentional scene text recognizer with flexible rectification. PAMI 41(9), 2035–2048 (2018)
Tsai, S.S., Chen, H., Chen, D., Schroth, G., Grzeszczuk, R., Girod, B.: Mobile visual search on printed documents using text and low bit-rate features. In: ICIP, pp. 2601–2604 (2011)
Vaswani, A., et al.: Attention is all you need. In: NIPS, vol. 30 (2017)
Wan, Z., He, M., Chen, H., Bai, X., Yao, C.: Textscanner: reading characters in order for robust scene text recognition. In: AAAI, vol. 34, pp. 12120–12127, April 2020
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464 (2011)
Wang, T., et al.: Decoupled attention network for text recognition. In: AAAI, vol. 34, pp. 12216–12224, April 2020. https://doi.org/10.1609/aaai.v34i07.6903
Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S., Zhang, Y.: From two to one: a new scene text recognizer with visual language modeling network. In: ICCV, pp. 14194–14203 (2021)
Wu, L., Liu, X., Hao, Y., Ma, Y., Hong, R.: Naster: non-local attentional scene text recognizer. In: ICMR, pp. 331–338 (2021)
Yan, R., Peng, L., Xiao, S., Yao, G.: Primitive representation learning for scene text recognition. In: CVPR, pp. 284–293 (2021)
Yang, M., et al.: Symmetry-constrained rectification network for scene text recognition. In: ICCV, October 2019
Yu, D., et al.: Towards accurate scene text recognition with semantic reasoning networks. In: CVPR, pp. 12113–12122 (2020)
Yue, X., Kuang, Z., Lin, C., Sun, H., Zhang, W.: RobustScanner: dynamically enhancing positional clues for robust text recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 135–151. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_9
Zhan, F., Lu, S.: ESIR: end-to-end scene text recognition via iterative image rectification. In: CVPR, June 2019
Zhang, C., et al.: Spin: structure-preserving inner offset network for scene text recognition. In: AAAI, vol. 35, pp. 3305–3314 (2021)
Zhang, H., Yao, Q., Yang, M., Xu, Y., Bai, X.: AutoSTR: efficient backbone search for scene text recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 751–767. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_44
Zhang, M., Ma, M., Wang, P.: Scene text recognition with cascade attention network. In: ICMR, pp. 385–393 (2021)
Acknowledgments
This work is partially supported by NTU Internal Funding - Accelerating Creativity and Excellence (NTU-ACE2020-03).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tan, Y.L., Kong, A.WK., Kim, JJ. (2022). Pure Transformer with Integrated Experts for Scene Text Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13688. Springer, Cham. https://doi.org/10.1007/978-3-031-19815-1_28
Download citation
DOI: https://doi.org/10.1007/978-3-031-19815-1_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19814-4
Online ISBN: 978-3-031-19815-1
eBook Packages: Computer ScienceComputer Science (R0)