Skip to main content

Context-Free TextSpotter for Real-Time and Mobile End-to-End Text Detection and Recognition

Part of the Lecture Notes in Computer Science book series (LNIP,volume 12822)

Abstract

In the deployment of scene-text spotting systems on mobile platforms, lightweight models with low computation are preferable. In concept, end-to-end (E2E) text spotting is suitable for such purposes because it performs text detection and recognition in a single model. However, current state-of-the-art E2E methods rely on heavy feature extractors, recurrent sequence modellings, and complex shape aligners to pursue accuracy, which means their computations are still heavy. We explore the opposite direction: How far can we go without bells and whistles in E2E text spotting? To this end, we propose a text-spotting method that consists of simple convolutions and a few post-processes, named Context-Free TextSpotter. Experiments using standard benchmarks show that Context-Free TextSpotter achieves real-time text spotting on a GPU with only three million parameters, which is the smallest and fastest among existing deep text spotters, with an acceptable transcription quality degradation compared to heavier ones. Further, we demonstrate that our text spotter can run on a smartphone with affordable latency, which is valuable for building stand-alone OCR applications.

Keywords

  • Scene text spotting
  • Mobile text recognition
  • Scene text detection and recognition

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-86331-9_16
  • Chapter length: 18 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   99.00
Price excludes VAT (USA)
  • ISBN: 978-3-030-86331-9
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   129.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.

Notes

  1. 1.

    Note that Fig. 3 does not precisely reflect the structure due to layout constraints.

  2. 2.

    https://github.com/jiangxiluning/FOTS.PyTorch.

References

  1. Apple Developer Documentation: MLComputeUnits. https://developer.apple.com/documentation/coreml/mlcomputeunits. Accessed 28 Jan 2021

  2. Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: CVPR, pp. 9365–9374 (2019)

    Google Scholar 

  3. Baek, Y., et al.: Character region attention for text spotting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 504–521. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_30

    CrossRef  Google Scholar 

  4. Busta, M., Neumann, L., Matas, J.: Deep TextSpotter: an end-to-end trainable scene text localization and recognition framework. In: ICCV, pp. 2204–2212 (2017)

    Google Scholar 

  5. Chen, H., et al.: AdderNet: do we really need multiplications in deep learning? In: CVPR, pp. 1468–1477 (2020)

    Google Scholar 

  6. Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: towards accurate text recognition in natural images. In: ICCV, pp. 5076–5084 (2017)

    Google Scholar 

  7. Córdova, M., Pinto, A., Pedrini, H., Torres, R.D.S.: Pelee-Text++: a tiny neural network for scene text detection. IEEE Access (2020)

    Google Scholar 

  8. Córdova, M.A., Decker, L.G., Flores-Campana, J.L., dos Santos, A.A., Conceição, J.S.: Pelee-Text: a tiny convolutional neural network for multi-oriented scene text detection. In: ICMLA, pp. 400–405 (2019)

    Google Scholar 

  9. Courbariaux, M., Bengio, Y., David, J.P.: BinaryConnect: training deep neural networks with binary weights during propagations. In: NeurIPS (2015)

    Google Scholar 

  10. Decker1a, L.G.L., et al.: MobText: a compact method for scene text localization. In: VISAPP (2020)

    Google Scholar 

  11. Deng, D., Liu, H., Li, X., Cai, D.: PixelLink: detecting scene text via instance segmentation. In: AAAI, vol. 32 (2018)

    Google Scholar 

  12. Fu, K., Sun, L., Kang, X., Ren, F.: Text detection for natural scene based on MobileNet V2 and U-Net. In: ICMA, pp. 1560–1564 (2019)

    Google Scholar 

  13. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: NeurIPS, pp. 369–376 (2006)

    Google Scholar 

  14. Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 799–804. Springer, Heidelberg (2005). https://doi.org/10.1007/11550907_126

    CrossRef  Google Scholar 

  15. Guan, J., Zhu, A.: Light Textspotter: an extreme light scene text spotter. In: Yang, H., Pasupa, K., Leung, A.C.-S., Kwok, J.T., Chan, J.H., King, I. (eds.) ICONIP 2020. CCIS, vol. 1332, pp. 434–441. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-63820-7_50

    CrossRef  Google Scholar 

  16. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315–2324 (2016)

    Google Scholar 

  17. Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., Xu, C.: GhostNet: more features from cheap operations. In: CVPR, pp. 1580–1589 (2020)

    Google Scholar 

  18. Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural networks. In: NeurIPS (2015)

    Google Scholar 

  19. He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., Li, X.: Single shot text detector with regional attention. In: ICCV, pp. 3047–3055 (2017)

    Google Scholar 

  20. He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., Sun, C.: An end-to-end TextSpotter with explicit alignment and attention. In: CVPR, pp. 5020–5029 (2018)

    Google Scholar 

  21. Howard, A., et al.: Searching for MobileNetV3. In: ICCV (2019)

    Google Scholar 

  22. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, pp. 4700–4708 (2017)

    Google Scholar 

  23. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18(1), 6869–6898 (2017)

    MathSciNet  MATH  Google Scholar 

  24. Jeon, M., Jeong, Y.S.: Compact and accurate scene text detector. Appl. Sci. 10(6), 2096 (2020)

    CrossRef  Google Scholar 

  25. Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: ICDAR, pp. 1156–1160 (2015)

    Google Scholar 

  26. Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: ICDAR, pp. 1484–1493 (2013)

    Google Scholar 

  27. Li, J., Zhou, Z., Su, Z., Huang, S., Jin, L.: A new parallel detection-recognition approach for end-to-end scene text extraction. In: ICDAR, pp. 1358–1365 (2019)

    Google Scholar 

  28. Liao, M., Lyu, P., He, M., Yao, C., Wu, W., Bai, X.: Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. PAMI (2019)

    Google Scholar 

  29. Liao, M., Pang, G., Huang, J., Hassner, T., Bai, X.: Mask TextSpotter v3: segmentation proposal network for robust scene text spotting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 706–722. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_41

    CrossRef  Google Scholar 

  30. Liao, M., Shi, B., Bai, X.: TextBoxes++: a single-shot oriented scene text detector. Trans. Image Process. 27(8), 3676–3690 (2018)

    MathSciNet  CrossRef  Google Scholar 

  31. Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2

    CrossRef  Google Scholar 

  32. Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., Yan, J.: FOTS: fast oriented text spotting with a unified network. In: CVPR, pp. 5676–5685 (2018)

    Google Scholar 

  33. Liu, Y., Chen, H., Shen, C., He, T., Jin, L., Wang, L.: ABCNet: real-time scene text spotting with adaptive Bezier-curve network. In: CVPR, pp. 9809–9818 (2020)

    Google Scholar 

  34. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015)

    Google Scholar 

  35. Long, S., Ruan, J., Zhang, W., He, X., Wu, W., Yao, C.: TextSnake: a flexible representation for detecting text of arbitrary shapes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 19–35. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_2

    CrossRef  Google Scholar 

  36. Lyu, P., Liao, M., Yao, C., Wu, W., Bai, X.: Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 71–88. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_5

    CrossRef  Google Scholar 

  37. Nayef, N., et al.: ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition? RRC-MLT-2019. In: ICDAR (2019)

    Google Scholar 

  38. Nielsen, J.: Usability Engineering. Morgan Kaufmann (1994)

    Google Scholar 

  39. Qiao, L., et al.: Mango: a mask attention guided one-stage scene text spotter. In: AAAI (2021)

    Google Scholar 

  40. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    CrossRef  Google Scholar 

  41. Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking segments. In: CVPR, pp. 2550–2558 (2017)

    Google Scholar 

  42. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. PAMI 39(11), 2298–2304 (2016)

    CrossRef  Google Scholar 

  43. Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: CVPR, pp. 4168–4176 (2016)

    Google Scholar 

  44. Tian, Z., Huang, W., He, T., He, P., Qiao, Yu.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4

    CrossRef  Google Scholar 

  45. Wang, C.Y., Mark Liao, H.Y., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: CSPNet: a new backbone that can enhance learning capability of CNN. In: International Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 390–391 (2020)

    Google Scholar 

  46. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV (2011)

    Google Scholar 

  47. Wang, R.J., Li, X., Ling, C.X.: Pelee: a real-time object detection system on mobile devices. NeurIPS 31, 1963–1972 (2018)

    Google Scholar 

  48. Wang, T., Wu, D.J., Coates, A., Ng, A.Y.: End-to-end text recognition with convolutional neural networks. In: ICPR (2012)

    Google Scholar 

  49. Wu, K., Otoo, E., Suzuki, K.: Optimizing two-pass connected-component labeling algorithms. Pattern Anal. Appl. 12(2), 117–135 (2009)

    MathSciNet  CrossRef  Google Scholar 

  50. Xing, L., Tian, Z., Huang, W., Scott, M.R.: Convolutional character networks. In: ICCV, pp. 9126–9136 (2019)

    Google Scholar 

  51. Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X.: Multi-oriented text detection with fully convolutional networks. In: CVPR, pp. 4159–4167 (2016)

    Google Scholar 

  52. Zhou, X., et al.: EAST: an efficient and accurate scene text detector. In: CVPR, pp. 5551–5560 (2017)

    Google Scholar 

  53. Zhu, X., et al.: Deep residual text detection network for scene text. In: ICDAR, vol. 1, pp. 807–812. IEEE (2017)

    Google Scholar 

  54. Zhu, Y., Wang, S., Huang, Z., Chen, K.: Text recognition in images based on transformer with hierarchical attention. In: ICIP, pp. 1945–1949. IEEE (2019)

    Google Scholar 

Download references

Acknowledgements

We would like to thank Katsushi Yamashita, Daeju Kim, and members of the AI Strategy Office in SoftBank for helpful discussion.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ryota Yoshihashi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Yoshihashi, R., Tanaka, T., Doi, K., Fujino, T., Yamashita, N. (2021). Context-Free TextSpotter for Real-Time and Mobile End-to-End Text Detection and Recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12822. Springer, Cham. https://doi.org/10.1007/978-3-030-86331-9_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86331-9_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86330-2

  • Online ISBN: 978-3-030-86331-9

  • eBook Packages: Computer ScienceComputer Science (R0)