Skip to main content

A Cost-Efficient Framework for Scene Text Detection in the Wild

Part of the Lecture Notes in Computer Science book series (LNAI,volume 13031)

Abstract

Scene text detection in the wild is a hot research area in the field of computer vision, which has achieved great progress with the aid of deep learning. However, training deep text detection models needs large amounts of annotations such as bounding boxes and quadrangles, which is laborious and expensive. Although synthetic data is easier to acquire, the model trained on this data has large performance gap with that trained on real data because of domain shift. To address this problem, we propose a novel two-stage framework for cost-efficient scene text detection. Specifically, in order to unleash the power of synthetic data, we design an unsupervised domain adaptation scheme consisting of Entropy-aware Global Transfer (EGT) and Text Region Transfer (TRT) to pre-train the model. Furthermore, we utilize minimal actively annotated and enhanced pseudo labeled real samples to fine-tune the model, aiming at saving the annotation cost. In this framework, both the diversity of the synthetic data and the reality of the unlabeled real data are fully exploited. Extensive experiments on various benchmarks show that the proposed framework significantly outperforms the baseline, and achieves desirable performance with even a few labeled real datasets.

Keywords

  • Scene text detection
  • Unsupervised domain adaptation
  • Semi-supervised active learning

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-89188-6_11
  • Chapter length: 15 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   84.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-89188-6
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   109.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.

References

  1. Chen, D., et al.: Cross-domain scene text detection via pixel and image-level adaptation. In: ICONIP, pp. 135–143 (2019)

    Google Scholar 

  2. Chen, Y., Wang, W., Zhou, Y., Yang, F., Yang, D., Wang, W.: Self-training for domain adaptive scene text detection. In: ICPR, pp. 850–857 (2021)

    Google Scholar 

  3. Chen, Y., Zhou, Y., Yang, D., Wang, W.: Constrained relation network for character detection in scene images. In: PRICAI, pp. 137–149 (2019)

    Google Scholar 

  4. Chen, Y., Li, W., Sakaridis, C., Dai, D., Van Gool, L.: Domain adaptive faster r-cnn for object detection in the wild. In: CVPR, pp. 3339–3348 (2018)

    Google Scholar 

  5. Deng, D., Liu, H., Li, X., Cai, D.: Pixellink: detecting scene text via instance segmentation. In: AAAI, vol. 32 (2018)

    Google Scholar 

  6. Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: ICML, pp. 1180–1189 (2015)

    Google Scholar 

  7. Gong, B., Shi, Y., Sha, F., Grauman, K.: Geodesic flow kernel for unsupervised domain adaptation. In: CVPR, pp. 2066–2073 (2012)

    Google Scholar 

  8. Guo, Y., Zhou, Y., Qin, X., Wang, W.: Which and where to focus: a simple yet accurate framework for arbitrary-shaped nearby text detection in scene images. In: ICANN (2021)

    Google Scholar 

  9. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: CVPR, pp. 2315–2324 (2016)

    Google Scholar 

  10. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV, pp. 2961–2969 (2017)

    Google Scholar 

  11. Karatzas, D., et al.: Icdar 2015 competition on robust reading. In: ICDAR, pp. 1156–1160 (2015)

    Google Scholar 

  12. Karatzas, D., et al.: Icdar 2013 robust reading competition. In: ICDAR, pp. 1484–1493 (2013)

    Google Scholar 

  13. Leng, Y., Xu, X., Qi, G.: Combining active learning and semi-supervised learning to construct svm classifier. Knowl.-Based Syst. 44, 121–131 (2013)

    CrossRef  Google Scholar 

  14. Li, W., Luo, D., Fang, B., Zhou, Y., Wang, W.: Video 3d sampling for self-supervised representation learning. arXiv preprint arXiv:2107.03578 (2021)

  15. Li, X., et al.: Dense semantic contrast for self-supervised visual representation learning. In: ACM MM (2021)

    Google Scholar 

  16. Liao, M., Shi, B., Bai, X.: Textboxes++: a single-shot oriented scene text detector. TIP 27(8), 3676–3690 (2018)

    MathSciNet  MATH  Google Scholar 

  17. Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., Yan, J.: Fots: fast oriented text spotting with a unified network. In: CVPR, pp. 5676–5685 (2018)

    Google Scholar 

  18. Luo, D., Fang, B., Zhou, Y., Zhou, Y., Wu, D., Wang, W.: Exploring relations in untrimmed videos for self-supervised learning. arXiv preprint arXiv:2008.02711 (2020)

  19. Luo, D., et al.: Video cloze procedure for self-supervised spatio-temporal learning. In: AAAI, pp. 11701–11708 (2020)

    Google Scholar 

  20. Pise, N.N., Kulkarni, P.: A survey of semi-supervised learning methods. In: CIS, vol. 2, pp. 30–34 (2008)

    Google Scholar 

  21. Qiao, Z., Qin, X., Zhou, Y., Yang, F., Wang, W.: Gaussian constrained attention network for scene text recognition. In: ICPR, pp. 3328–3335 (2021)

    Google Scholar 

  22. Qiao, Z., et al.: PIMNet: a parallel, iterative and mimicking network for scene text recognition. In: ACM MM (2021)

    Google Scholar 

  23. Qiao, Z., Zhou, Y., Yang, D., Zhou, Y., Wang, W.: Seed: Semantics enhanced encoder-decoder framework for scene text recognition. In: CVPR, pp. 13528–13537 (2020)

    Google Scholar 

  24. Qin, X., et al.: Mask is all you need: Rethinking mask r-cnn for dense and arbitrary-shaped scene text detection. In: ACM MM (2021)

    Google Scholar 

  25. Qin, X., Zhou, Y., Guo, Y., Wu, D., Wang, W.: Fc 2 rn: a fully convolutional corner refinement network for accurate multi-oriented scene text detection. In: ICASSP. pp. 4350–4354 (2021)

    Google Scholar 

  26. Qin, X., Zhou, Y., Yang, D., Wang, W.: Curved text detection in natural scene images with semi-and weakly-supervised learning. In: ICDAR, pp. 559–564 (2019)

    Google Scholar 

  27. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. TPAMI 39(6), 1137–1149 (2016)

    CrossRef  Google Scholar 

  28. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. TPAMI 39(11), 2298–2304 (2016)

    CrossRef  Google Scholar 

  29. Tian, S., Lu, S., Li, C.: Wetext: scene text detection under weak supervision. In: ICCV, pp. 1492–1500 (2017)

    Google Scholar 

  30. Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: ECCV, pp. 56–72 (2016)

    Google Scholar 

  31. Wang, K., Zhang, D., Li, Y., Zhang, R., Lin, L.: Cost-effective active learning for deep image classification. TCSVT 27(12), 2591–2600 (2016)

    Google Scholar 

  32. Wang, W., et al.: Shape robust text detection with progressive scale expansion network. In: CVPR, pp. 9336–9345 (2019)

    Google Scholar 

  33. Wang, W., et al.: Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: ICCV, pp. 8440–8449 (2019)

    Google Scholar 

  34. Wang, X., Wen, J., Alam, S., Jiang, Z., Wu, Y.: Semi-supervised learning combining transductive support vector machine with active learning. Neurocomputing 173, 1288–1298 (2016)

    CrossRef  Google Scholar 

  35. Wu, W., et al.: Synthetic-to-real unsupervised domain adaptation for scene text detection in the wild. In: ACCV (2020)

    Google Scholar 

  36. Yang, D., Zhou, Y., Wang, W.: Multi-view correlation distillation for incremental object detection. arXiv preprint arXiv:2107.01787 (2021)

  37. Yang, D., Zhou, Y., Wu, D., Ma, C., Yang, F., Wang, W.: Two-level residual distillation based triple network for incremental object detection. arXiv preprint arXiv:2007.13428 (2020)

  38. Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q.: Video playback rate perception for self-supervised spatio-temporal representation learning. In: CVPR, pp. 6548–6557 (2020)

    Google Scholar 

  39. Ye, Q., Doermann, D.: Text detection and recognition in imagery: a survey. TPAMI 37(7), 1480–1500 (2014)

    CrossRef  Google Scholar 

  40. Yoo, D., Kweon, I.S.: Learning loss for active learning. In: CVPR, pp. 93–102 (2019)

    Google Scholar 

  41. Zeng, G., Zhang, Y., Zhou, Y., Yang, X.: Beyond OCR + VQA: involving OCR into the flow for robust and accurate TextVQA. In: ACM MM (2021)

    Google Scholar 

  42. Zhan, F., Lu, S., Xue, C.: Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In: ECCV, pp. 249–266 (2018)

    Google Scholar 

  43. Zhan, F., Xue, C., Lu, S.: Ga-dan: geometry-aware domain adaptation network for scene text detection and recognition. In: ICCV, pp. 9105–9115 (2019)

    Google Scholar 

  44. Zhang, Y., Liu, C., Zhou, Y., Wang, W., Wang, W., Ye, Q.: Progressive cluster purification for unsupervised feature learning. In: ICPR, pp. 8476–8483 (2021)

    Google Scholar 

  45. Zhang, Y., Zhou, Y., Wang, W.: Exploring instance relations for unsupervised feature embedding. arXiv preprint arXiv:2105.03341 (2021)

  46. Zheng, Y., Huang, D., Liu, S., Wang, Y.: Cross-domain object detection through coarse-to-fine feature adaptation. In: CVPR, pp. 13766–13775 (2020)

    Google Scholar 

  47. Zhou, X., et al.: East: an efficient and accurate scene text detector. In: CVPR, pp. 5551–5560 (2017)

    Google Scholar 

Download references

Acknowledgments

This work is supported by the Open Research Project of the State Key Laboratory of Media Convergence and Communication, Communication University of China, China (No. SKLMCC2020KF004), the Beijing Municipal Science & Technology Commission (Z191100007119002), the Key Research Program of Frontier Sciences, CAS, Grant NO ZDBS-LY-7024, and the National Natural Science Foundation of China (No. 62006221).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Zhou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Zeng, G., Zhang, Y., Zhou, Y., Yang, X. (2021). A Cost-Efficient Framework for Scene Text Detection in the Wild. In: Pham, D.N., Theeramunkong, T., Governatori, G., Liu, F. (eds) PRICAI 2021: Trends in Artificial Intelligence. PRICAI 2021. Lecture Notes in Computer Science(), vol 13031. Springer, Cham. https://doi.org/10.1007/978-3-030-89188-6_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-89188-6_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-89187-9

  • Online ISBN: 978-3-030-89188-6

  • eBook Packages: Computer ScienceComputer Science (R0)