Skip to main content

Vision Transformer for Fast and Efficient Scene Text Recognition

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2021 (ICDAR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12821))

Included in the following conference series:

Abstract

Scene text recognition (STR) enables computers to read text in natural scenes such as object labels, road signs and instructions. STR helps machines perform informed decisions such as what object to pick, which direction to go, and what is the next step of action. In the body of work on STR, the focus has always been on recognition accuracy. There is little emphasis placed on speed and computational efficiency which are equally important especially for energy-constrained mobile machines. In this paper we propose ViTSTR, an STR with a simple single stage model architecture built on a compute and parameter efficient vision transformer (ViT). On a comparable strong baseline method such as TRBA with accuracy of 84.3%, our small ViTSTR achieves a competitive accuracy of 82.6% (84.2% with data augmentation) at \(2.4\times \) speed up, using only 43.4% of the number of parameters and 42.2% FLOPS. The tiny version of ViTSTR achieves 80.3% accuracy (82.1% with data augmentation), at \(2.5\times \) the speed, requiring only 10.9% of the number of parameters and 11.9% FLOPS. With data augmentation, our base ViTSTR outperforms TRBA at 85.2% accuracy (83.7% without augmentation) at \(2.3\times \) the speed but requires 73.2% more parameters and 61.5% more FLOPS. In terms of trade-offs, nearly all ViTSTR configurations are at or near the frontiers to maximize accuracy, speed and computational efficiency all at the same time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Baek, J., et al.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: ICCV, pp. 4715–4723 (2019)

    Google Scholar 

  2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  3. Bookstein, F.L.: Principal warps: thin-plate splines and the decomposition of deformations. Trans. Pattern Anal. Mach. Intell. 11(6), 567–585 (1989)

    Article  Google Scholar 

  4. Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: large scale system for text detection and recognition in images. In: International Conference on Knowledge Discovery & Data Mining, pp. 71–79 (2018)

    Google Scholar 

  5. Chen, X., Jin, L., Zhu, Y., Luo, C., Wang, T.: Text recognition in the wild: A survey. arXiv preprint arXiv:2005.03492 (2020)

  6. Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703 (2020)

    Google Scholar 

  7. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2020)

    Google Scholar 

  8. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: ICML, pp. 369–376 (2006)

    Google Scholar 

  9. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315–2324 (2016)

    Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: ICCV, pp. 1026–1034 (2015)

    Google Scholar 

  11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  12. Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)

  13. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  14. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: NIPS Workshop on Deep Learning (2014)

    Google Scholar 

  15. Karatzas, D., et al.: Icdar 2015 competition on robust reading. In: ICDAR, pp. 1156–1160. IEEE (2015)

    Google Scholar 

  16. Karatzas, D., et al.: Icdar 2013 robust reading competition. In: ICDAR, pp. 1484–1493. IEEE (2013)

    Google Scholar 

  17. Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for ocr in the wild. In: CVPR, pp. 2231–2239 (2016)

    Google Scholar 

  18. Lee, J., Park, S., Baek, J., Oh, S.J., Kim, S., Lee, H.: On recognizing texts of arbitrary shapes with 2d self-attention. In: CVPR Workshops, pp. 546–547 (2020)

    Google Scholar 

  19. Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: a simple and strong baseline for irregular text recognition. AAAI 33, 8610–8617 (2019)

    Article  Google Scholar 

  20. Litman, R., Anschel, O., Tsiper, S., Litman, R., Mazor, S., Manmatha, R.: Scatter: selective context attentional scene text recognizer. In: CVPR, pp. 11962–11972 (2020)

    Google Scholar 

  21. Liu, W., Chen, C., Wong, K.Y.K., Su, Z., Han, J.: Star-net: a spatial attention residue network for scene text recognition. In: BMVC, vol. 2, p. 7 (2016)

    Google Scholar 

  22. Lucas, S.M., et al.: Icdar 2003 robust reading competitions: entries, results, and future directions. Int. J. Doc. Anal. Recogn. 7(2–3), 105–122 (2005)

    Article  Google Scholar 

  23. Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: BMVC. BMVA (2012)

    Google Scholar 

  24. Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: CVPR, pp. 3538–3545. IEEE (2012)

    Google Scholar 

  25. Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: ICCV, pp. 569–576 (2013)

    Google Scholar 

  26. Qiao, Z., Zhou, Y., Yang, D., Zhou, Y., Wang, W.: Seed: semantics enhanced encoder-decoder framework for scene text recognition. In: CVPR, pp. 13528–13537 (2020)

    Google Scholar 

  27. Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 41(18), 8027–8048 (2014)

    Article  Google Scholar 

  28. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  29. Sheng, F., Chen, Z., Xu, B.: Nrtr: a no-recurrence sequence-to-sequence model for scene text recognition. In: ICDAR, pp. 781–786. IEEE (2019)

    Google Scholar 

  30. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)

    Article  Google Scholar 

  31. Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: CVPR, pp. 4168–4176 (2016)

    Google Scholar 

  32. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

    Google Scholar 

  33. Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: ICML, pp. 6105–6114. PMLR (2019)

    Google Scholar 

  34. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877 (2020)

  35. Vaswani, A., et al.: Attention is all you need. In: NeuRIPS, pp. 6000–6010 (2017)

    Google Scholar 

  36. Wang, J., Hu, X.: Gated recurrent convolution neural network for ocr. In: NeuRIPS, pp. 334–343 (2017)

    Google Scholar 

  37. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457–1464. IEEE (2011)

    Google Scholar 

  38. Yao, C., Bai, X., Liu, W.: A unified framework for multioriented text detection and recognition. Trans. Image Process. 23(11), 4737–4749 (2014)

    Article  MathSciNet  Google Scholar 

  39. Yu, D., et al.: Towards accurate scene text recognition with semantic reasoning networks. In: CVPR, pp. 12113–12122 (2020)

    Google Scholar 

  40. Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)

  41. Zhan, F., Lu, S.: Esir: end-to-end scene text recognition via iterative image rectification. In: CVPR, pp. 2059–2068 (2019)

    Google Scholar 

Download references

Acknowledgements

This work was funded by the University of the Philippines ECWRG 2019–2020. GPU machines have been supported by CHED-PCARI AIRSCAN Project and Samsung R&D PH. Special thanks to the people of Computer Networks Laboratory: Roel Ocampo, Vladimir Zurbano, Lope Beltran II, and John Robert Mendoza, who worked tirelessly during the pandemic to ensure that our network and servers are continuously running.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rowel Atienza .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Atienza, R. (2021). Vision Transformer for Fast and Efficient Scene Text Recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12821. Springer, Cham. https://doi.org/10.1007/978-3-030-86549-8_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86549-8_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86548-1

  • Online ISBN: 978-3-030-86549-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics