Vision Transformer for Fast and Efficient Scene Text Recognition

Atienza, Rowel

doi:10.1007/978-3-030-86549-8_21

Rowel Atienza ORCID: orcid.org/0000-0002-8830-2534¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12821))

Included in the following conference series:

International Conference on Document Analysis and Recognition

4561 Accesses
62 Citations

Abstract

Scene text recognition (STR) enables computers to read text in natural scenes such as object labels, road signs and instructions. STR helps machines perform informed decisions such as what object to pick, which direction to go, and what is the next step of action. In the body of work on STR, the focus has always been on recognition accuracy. There is little emphasis placed on speed and computational efficiency which are equally important especially for energy-constrained mobile machines. In this paper we propose ViTSTR, an STR with a simple single stage model architecture built on a compute and parameter efficient vision transformer (ViT). On a comparable strong baseline method such as TRBA with accuracy of 84.3%, our small ViTSTR achieves a competitive accuracy of 82.6% (84.2% with data augmentation) at \(2.4\times \) speed up, using only 43.4% of the number of parameters and 42.2% FLOPS. The tiny version of ViTSTR achieves 80.3% accuracy (82.1% with data augmentation), at \(2.5\times \) the speed, requiring only 10.9% of the number of parameters and 11.9% FLOPS. With data augmentation, our base ViTSTR outperforms TRBA at 85.2% accuracy (83.7% without augmentation) at \(2.3\times \) the speed but requires 73.2% more parameters and 61.5% more FLOPS. In terms of trade-offs, nearly all ViTSTR configurations are at or near the frontiers to maximize accuracy, speed and computational efficiency all at the same time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Baek, J., et al.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: ICCV, pp. 4715–4723 (2019)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Bookstein, F.L.: Principal warps: thin-plate splines and the decomposition of deformations. Trans. Pattern Anal. Mach. Intell. 11(6), 567–585 (1989)
Article Google Scholar
Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: large scale system for text detection and recognition in images. In: International Conference on Knowledge Discovery & Data Mining, pp. 71–79 (2018)
Google Scholar
Chen, X., Jin, L., Zhu, Y., Luo, C., Wang, T.: Text recognition in the wild: A survey. arXiv preprint arXiv:2005.03492 (2020)
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703 (2020)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2020)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: ICML, pp. 369–376 (2006)
Google Scholar
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315–2324 (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: ICCV, pp. 1026–1034 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: NIPS Workshop on Deep Learning (2014)
Google Scholar
Karatzas, D., et al.: Icdar 2015 competition on robust reading. In: ICDAR, pp. 1156–1160. IEEE (2015)
Google Scholar
Karatzas, D., et al.: Icdar 2013 robust reading competition. In: ICDAR, pp. 1484–1493. IEEE (2013)
Google Scholar
Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for ocr in the wild. In: CVPR, pp. 2231–2239 (2016)
Google Scholar
Lee, J., Park, S., Baek, J., Oh, S.J., Kim, S., Lee, H.: On recognizing texts of arbitrary shapes with 2d self-attention. In: CVPR Workshops, pp. 546–547 (2020)
Google Scholar
Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: a simple and strong baseline for irregular text recognition. AAAI 33, 8610–8617 (2019)
Article Google Scholar
Litman, R., Anschel, O., Tsiper, S., Litman, R., Mazor, S., Manmatha, R.: Scatter: selective context attentional scene text recognizer. In: CVPR, pp. 11962–11972 (2020)
Google Scholar
Liu, W., Chen, C., Wong, K.Y.K., Su, Z., Han, J.: Star-net: a spatial attention residue network for scene text recognition. In: BMVC, vol. 2, p. 7 (2016)
Google Scholar
Lucas, S.M., et al.: Icdar 2003 robust reading competitions: entries, results, and future directions. Int. J. Doc. Anal. Recogn. 7(2–3), 105–122 (2005)
Article Google Scholar
Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: BMVC. BMVA (2012)
Google Scholar
Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: CVPR, pp. 3538–3545. IEEE (2012)
Google Scholar
Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: ICCV, pp. 569–576 (2013)
Google Scholar
Qiao, Z., Zhou, Y., Yang, D., Zhou, Y., Wang, W.: Seed: semantics enhanced encoder-decoder framework for scene text recognition. In: CVPR, pp. 13528–13537 (2020)
Google Scholar
Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027–8048 (2014)
Article Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Sheng, F., Chen, Z., Xu, B.: Nrtr: a no-recurrence sequence-to-sequence model for scene text recognition. In: ICDAR, pp. 781–786. IEEE (2019)
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)
Article Google Scholar
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: CVPR, pp. 4168–4176 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: ICML, pp. 6105–6114. PMLR (2019)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877 (2020)
Vaswani, A., et al.: Attention is all you need. In: NeuRIPS, pp. 6000–6010 (2017)
Google Scholar
Wang, J., Hu, X.: Gated recurrent convolution neural network for ocr. In: NeuRIPS, pp. 334–343 (2017)
Google Scholar
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464. IEEE (2011)
Google Scholar
Yao, C., Bai, X., Liu, W.: A unified framework for multioriented text detection and recognition. Trans. Image Process. 23(11), 4737–4749 (2014)
Article MathSciNet Google Scholar
Yu, D., et al.: Towards accurate scene text recognition with semantic reasoning networks. In: CVPR, pp. 12113–12122 (2020)
Google Scholar
Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)
Zhan, F., Lu, S.: Esir: end-to-end scene text recognition via iterative image rectification. In: CVPR, pp. 2059–2068 (2019)
Google Scholar

Download references

Acknowledgements

This work was funded by the University of the Philippines ECWRG 2019–2020. GPU machines have been supported by CHED-PCARI AIRSCAN Project and Samsung R&D PH. Special thanks to the people of Computer Networks Laboratory: Roel Ocampo, Vladimir Zurbano, Lope Beltran II, and John Robert Mendoza, who worked tirelessly during the pandemic to ensure that our network and servers are continuously running.

Author information

Authors and Affiliations

Electrical and Electronics Engineering Institute, University of the Philippines, Quezon City, Philippines
Rowel Atienza

Authors

Rowel Atienza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rowel Atienza .

Editor information

Editors and Affiliations

Universitat Autònoma de Barcelona, Barcelona, Spain
Josep Lladós
Lehigh University, Bethlehem, PA, USA
Daniel Lopresti
Kyushu University, Fukuoka-shi, Japan
Seiichi Uchida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Atienza, R. (2021). Vision Transformer for Fast and Efficient Scene Text Recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12821. Springer, Cham. https://doi.org/10.1007/978-3-030-86549-8_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-86549-8_21
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86548-1
Online ISBN: 978-3-030-86549-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)