Context-Free TextSpotter for Real-Time and Mobile End-to-End Text Detection and Recognition

Yoshihashi, Ryota; Tanaka, Tomohiro; Doi, Kenji; Fujino, Takumi; Yamashita, Naoaki

doi:10.1007/978-3-030-86331-9_16

Ryota Yoshihashi¹¹,
Tomohiro Tanaka¹¹,
Kenji Doi¹¹,
Takumi Fujino¹¹ &
…
Naoaki Yamashita¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12822))

Included in the following conference series:

International Conference on Document Analysis and Recognition

3426 Accesses
3 Citations

Abstract

In the deployment of scene-text spotting systems on mobile platforms, lightweight models with low computation are preferable. In concept, end-to-end (E2E) text spotting is suitable for such purposes because it performs text detection and recognition in a single model. However, current state-of-the-art E2E methods rely on heavy feature extractors, recurrent sequence modellings, and complex shape aligners to pursue accuracy, which means their computations are still heavy. We explore the opposite direction: How far can we go without bells and whistles in E2E text spotting? To this end, we propose a text-spotting method that consists of simple convolutions and a few post-processes, named Context-Free TextSpotter. Experiments using standard benchmarks show that Context-Free TextSpotter achieves real-time text spotting on a GPU with only three million parameters, which is the smallest and fastest among existing deep text spotters, with an acceptable transcription quality degradation compared to heavier ones. Further, we demonstrate that our text spotter can run on a smartphone with affordable latency, which is valuable for building stand-alone OCR applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Note that Fig. 3 does not precisely reflect the structure due to layout constraints.
2.
https://github.com/jiangxiluning/FOTS.PyTorch.

References

Apple Developer Documentation: MLComputeUnits. https://developer.apple.com/documentation/coreml/mlcomputeunits. Accessed 28 Jan 2021
Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: CVPR, pp. 9365–9374 (2019)
Google Scholar
Baek, Y., et al.: Character region attention for text spotting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 504–521. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_30
Chapter Google Scholar
Busta, M., Neumann, L., Matas, J.: Deep TextSpotter: an end-to-end trainable scene text localization and recognition framework. In: ICCV, pp. 2204–2212 (2017)
Google Scholar
Chen, H., et al.: AdderNet: do we really need multiplications in deep learning? In: CVPR, pp. 1468–1477 (2020)
Google Scholar
Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: towards accurate text recognition in natural images. In: ICCV, pp. 5076–5084 (2017)
Google Scholar
Córdova, M., Pinto, A., Pedrini, H., Torres, R.D.S.: Pelee-Text++: a tiny neural network for scene text detection. IEEE Access (2020)
Google Scholar
Córdova, M.A., Decker, L.G., Flores-Campana, J.L., dos Santos, A.A., Conceição, J.S.: Pelee-Text: a tiny convolutional neural network for multi-oriented scene text detection. In: ICMLA, pp. 400–405 (2019)
Google Scholar
Courbariaux, M., Bengio, Y., David, J.P.: BinaryConnect: training deep neural networks with binary weights during propagations. In: NeurIPS (2015)
Google Scholar
Decker1a, L.G.L., et al.: MobText: a compact method for scene text localization. In: VISAPP (2020)
Google Scholar
Deng, D., Liu, H., Li, X., Cai, D.: PixelLink: detecting scene text via instance segmentation. In: AAAI, vol. 32 (2018)
Google Scholar
Fu, K., Sun, L., Kang, X., Ren, F.: Text detection for natural scene based on MobileNet V2 and U-Net. In: ICMA, pp. 1560–1564 (2019)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: NeurIPS, pp. 369–376 (2006)
Google Scholar
Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 799–804. Springer, Heidelberg (2005). https://doi.org/10.1007/11550907_126
Chapter Google Scholar
Guan, J., Zhu, A.: Light Textspotter: an extreme light scene text spotter. In: Yang, H., Pasupa, K., Leung, A.C.-S., Kwok, J.T., Chan, J.H., King, I. (eds.) ICONIP 2020. CCIS, vol. 1332, pp. 434–441. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-63820-7_50
Chapter Google Scholar
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315–2324 (2016)
Google Scholar
Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., Xu, C.: GhostNet: more features from cheap operations. In: CVPR, pp. 1580–1589 (2020)
Google Scholar
Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural networks. In: NeurIPS (2015)
Google Scholar
He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., Li, X.: Single shot text detector with regional attention. In: ICCV, pp. 3047–3055 (2017)
Google Scholar
He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., Sun, C.: An end-to-end TextSpotter with explicit alignment and attention. In: CVPR, pp. 5020–5029 (2018)
Google Scholar
Howard, A., et al.: Searching for MobileNetV3. In: ICCV (2019)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, pp. 4700–4708 (2017)
Google Scholar
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18(1), 6869–6898 (2017)
MathSciNet MATH Google Scholar
Jeon, M., Jeong, Y.S.: Compact and accurate scene text detector. Appl. Sci. 10(6), 2096 (2020)
Article Google Scholar
Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: ICDAR, pp. 1156–1160 (2015)
Google Scholar
Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: ICDAR, pp. 1484–1493 (2013)
Google Scholar
Li, J., Zhou, Z., Su, Z., Huang, S., Jin, L.: A new parallel detection-recognition approach for end-to-end scene text extraction. In: ICDAR, pp. 1358–1365 (2019)
Google Scholar
Liao, M., Lyu, P., He, M., Yao, C., Wu, W., Bai, X.: Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. PAMI (2019)
Google Scholar
Liao, M., Pang, G., Huang, J., Hassner, T., Bai, X.: Mask TextSpotter v3: segmentation proposal network for robust scene text spotting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 706–722. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_41
Chapter Google Scholar
Liao, M., Shi, B., Bai, X.: TextBoxes++: a single-shot oriented scene text detector. Trans. Image Process. 27(8), 3676–3690 (2018)
Article MathSciNet Google Scholar
Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., Yan, J.: FOTS: fast oriented text spotting with a unified network. In: CVPR, pp. 5676–5685 (2018)
Google Scholar
Liu, Y., Chen, H., Shen, C., He, T., Jin, L., Wang, L.: ABCNet: real-time scene text spotting with adaptive Bezier-curve network. In: CVPR, pp. 9809–9818 (2020)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015)
Google Scholar
Long, S., Ruan, J., Zhang, W., He, X., Wu, W., Yao, C.: TextSnake: a flexible representation for detecting text of arbitrary shapes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 19–35. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_2
Chapter Google Scholar
Lyu, P., Liao, M., Yao, C., Wu, W., Bai, X.: Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 71–88. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_5
Chapter Google Scholar
Nayef, N., et al.: ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition? RRC-MLT-2019. In: ICDAR (2019)
Google Scholar
Nielsen, J.: Usability Engineering. Morgan Kaufmann (1994)
Google Scholar
Qiao, L., et al.: Mango: a mask attention guided one-stage scene text spotter. In: AAAI (2021)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking segments. In: CVPR, pp. 2550–2558 (2017)
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. PAMI 39(11), 2298–2304 (2016)
Article Google Scholar
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: CVPR, pp. 4168–4176 (2016)
Google Scholar
Tian, Z., Huang, W., He, T., He, P., Qiao, Yu.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_4
Chapter Google Scholar
Wang, C.Y., Mark Liao, H.Y., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: CSPNet: a new backbone that can enhance learning capability of CNN. In: International Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 390–391 (2020)
Google Scholar
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV (2011)
Google Scholar
Wang, R.J., Li, X., Ling, C.X.: Pelee: a real-time object detection system on mobile devices. NeurIPS 31, 1963–1972 (2018)
Google Scholar
Wang, T., Wu, D.J., Coates, A., Ng, A.Y.: End-to-end text recognition with convolutional neural networks. In: ICPR (2012)
Google Scholar
Wu, K., Otoo, E., Suzuki, K.: Optimizing two-pass connected-component labeling algorithms. Pattern Anal. Appl. 12(2), 117–135 (2009)
Article MathSciNet Google Scholar
Xing, L., Tian, Z., Huang, W., Scott, M.R.: Convolutional character networks. In: ICCV, pp. 9126–9136 (2019)
Google Scholar
Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X.: Multi-oriented text detection with fully convolutional networks. In: CVPR, pp. 4159–4167 (2016)
Google Scholar
Zhou, X., et al.: EAST: an efficient and accurate scene text detector. In: CVPR, pp. 5551–5560 (2017)
Google Scholar
Zhu, X., et al.: Deep residual text detection network for scene text. In: ICDAR, vol. 1, pp. 807–812. IEEE (2017)
Google Scholar
Zhu, Y., Wang, S., Huang, Z., Chen, K.: Text recognition in images based on transformer with hierarchical attention. In: ICIP, pp. 1945–1949. IEEE (2019)
Google Scholar

Download references

Acknowledgements

We would like to thank Katsushi Yamashita, Daeju Kim, and members of the AI Strategy Office in SoftBank for helpful discussion.

Author information

Authors and Affiliations

Yahoo Japan Corporation, Tokyo, Japan
Ryota Yoshihashi, Tomohiro Tanaka, Kenji Doi, Takumi Fujino & Naoaki Yamashita

Authors

Ryota Yoshihashi
View author publications
You can also search for this author in PubMed Google Scholar
Tomohiro Tanaka
View author publications
You can also search for this author in PubMed Google Scholar
Kenji Doi
View author publications
You can also search for this author in PubMed Google Scholar
Takumi Fujino
View author publications
You can also search for this author in PubMed Google Scholar
Naoaki Yamashita
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ryota Yoshihashi .

Editor information

Editors and Affiliations

Universitat Autònoma de Barcelona, Barcelona, Spain
Josep Lladós
Lehigh University, Bethlehem, PA, USA
Daniel Lopresti
Kyushu University, Fukuoka-shi, Japan
Seiichi Uchida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yoshihashi, R., Tanaka, T., Doi, K., Fujino, T., Yamashita, N. (2021). Context-Free TextSpotter for Real-Time and Mobile End-to-End Text Detection and Recognition. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12822. Springer, Cham. https://doi.org/10.1007/978-3-030-86331-9_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-86331-9_16
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86330-2
Online ISBN: 978-3-030-86331-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)