Skip to main content
Log in

Separating Content from Style Using Adversarial Learning for Recognizing Text in the Wild

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Scene text recognition is an important task in computer vision. Despite tremendous progress achieved in the past few years, issues such as varying font styles, arbitrary shapes and complex backgrounds etc. have made the problem very challenging. In this work, we propose to improve text recognition from a new perspective by separating the text content from complex backgrounds, thus making the recognition considerably easier and significantly improving recognition accuracy. To this end, we exploit the generative adversarial networks (GANs) for removing backgrounds while retaining the text content . As vanilla GANs are not sufficiently robust to generate sequence-like characters in natural images, we propose an adversarial learning framework for the generation and recognition of multiple characters in an image. The proposed framework consists of an attention-based recognizer and a generative adversarial architecture. Furthermore, to tackle the issue of lacking paired training samples, we design an interactive joint training scheme, which shares attention masks from the recognizer to the discriminator, and enables the discriminator to extract the features of each character for further adversarial training. Benefiting from the character-level adversarial training, our framework requires only unpaired simple data for style supervision. Each target style sample containing only one randomly chosen character can be simply synthesized online during the training. This is significant as the training does not require costly paired samples or character-level annotations. Thus, only the input images and corresponding text labels are needed. In addition to the style normalization of the backgrounds, we refine character patterns to ease the recognition task. A feedback mechanism is proposed to bridge the gap between the discriminator and the recognizer. Therefore, the discriminator can guide the generator according to the confusion of the recognizer, so that the generated patterns are clearer for recognition. Experiments on various benchmarks, including both regular and irregular text, demonstrate that our method significantly reduces the difficulty of recognition. Our framework can be integrated into recent recognition methods to achieve new state-of-the-art recognition accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. https://fonts.google.com

  2. https://pillow.readthedocs.io/en/stable/reference/ImageDraw.html

  3. The official implementation is available on https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

References

  • Arjovsky, M., Chintala, S., Bottou, L. (2017) Wasserstein generative adversarial networks. In: International Conference on Machine Learning (ICML), pp 214–223.

  • Azadi, S., Fisher, M., Kim, VG., Wang, Z., Shechtman, E., Darrell, T. (2018) Multi-content gan for few-shot font style transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 7564–7573.

  • Bahdanau, D., Cho, K., Bengio, Y. (2015) Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (ICLR).

  • Bai, F., Cheng, Z., Niu, Y., Pu, S., Zhou, S. (2018) Edit probability for scene text recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1508–1516.

  • Bau, D., Zhu, JY., Wulff, J., Peebles, W., Strobelt, H., Zhou, B., Torralba, A. (2019) Seeing what a gan cannot generate. In: IEEE International Conference on Computer Vision (ICCV), pp 4502–4511.

  • Berthelot, D., Schumm, T., Metz, L. (2017) BEGAN: boundary equilibrium generative adversarial networks. CoRR abs/1703.10717.

  • Bissacco, A., Cummins, M., Netzer, Y., Neven, H. (2013) PhotoOCR: Reading text in uncontrolled conditions. In: IEEE International Conference on Computer Vision (ICCV), pp 785–792.

  • Casey, R. G., & Lecolinet, E. (1996). A survey of methods and strategies in character segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 18(7), 690–706.

    Article  Google Scholar 

  • Cheng, M. M., Liu, X. C., Wang, J., Lu, S. P., Lai, Y. K., & Rosin, P. L. (2019). Structure-Preserving Neural Style Transfer. IEEE Transactions on Image Processing (TIP), 29, 909–920.

    Article  MathSciNet  Google Scholar 

  • Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S. (2017) Focusing attention: Towards accurate text recognition in natural images. In: IEEE International Conference on Computer Vision (ICCV), pp 5086–5094.

  • Cheng, Z., Xu, Y., Bai, F., Niu, Y., Pu, S., Zhou, S. (2018) AON: Towards arbitrarily-oriented text recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5571–5579.

  • Cong, F., Hu, W., Huo, Q., Guo, L. (2019) A comparative study of attention-based encoder-decoder approaches to natural scene text recognition. In: International Conference on Document Analysis and Recognition (ICDAR), pp 916–921.

  • Dalal, N., Triggs, B. (2005) Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 886–893.

  • Fang, S., Xie, H., Chen, J., Tan, J., Zhang, Y. (2019) Learning to draw text in natural images with conditional adversarial networks. In: International Joint Conferences on Artificial Intelligence (IJCAI).

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y. (2014) Generative adversarial nets. Neural Information Processing Systems (NeurIPS), pp 2672–2680.

  • Gordo, A. (2015) Supervised mid-level features for word image representation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2956–2964.

  • Gupta, A., Vedaldi, A., Zisserman, A. (2016) Synthetic data for text localisation in natural images. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2315–2324.

  • He, K., Zhang, X., Ren, S., Sun, J. (2016a) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778.

  • He, P., Huang, W., Qiao, Y., Loy, CC., Tang, X. (2016b) Reading scene text in deep convolutional sequences. In: AAAI Conference on Artificial Intelligence (AAAI), pp 3501–3508.

  • Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Neural Information Processing Systems (NeurIPS), 30, 6626–6637.

    Google Scholar 

  • Isola, P., Zhu, JY., Zhou, T., Efros, AA. (2017) Image-to-image translation with conditional adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5967–5976.

  • Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A. (2014a) Synthetic data and artificial neural networks for natural scene text recognition. Neural Information Processing Systems (NeurIPS) Deep Learning Workshop.

  • Jaderberg, M., Vedaldi, A., Zisserman, A. (2014b) Deep features for text spotting. In: European Conference on Computer Vision (ECCV), pp 512–528.

  • Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A. (2015) Deep structured output learning for unconstrained text recognition. In: International Conference on Learning Representations (ICLR).

  • Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International Journal of Computer Vision (IJCV), 116(1), 1–20.

    Article  MathSciNet  Google Scholar 

  • Jing, Y., Yang, Y., Feng, Z., Ye, J., Yu, Y., & Song, M. (2019). Neural style transfer: A review. IEEE Transactions on Visualization and Computer Graphics (TVCG), 26(11), 3365–3385.

    Article  Google Scholar 

  • Johnson, J., Alahi, A., Fei-Fei, L. (2016) Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision (ECCV), pp 694–711.

  • Karatzas, D., Shafait, F., Uchida, S., Iwamura, M, i Bigorda, LG., Mestre, SR., Mas, J., Mota, DF., Almazan, JA., De Las Heras, LP. (2013) ICDAR 2013 robust reading competition. In: International Conference on Document Analysis and Recognition (ICDAR), pp 1484–1493.

  • Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, VR., Lu, S., et al. (2015) ICDAR 2015 competition on robust reading. In: International Conference on Document Analysis and Recognition (ICDAR), pp 1156–1160.

  • Kingma, D., Ba, L., et al. (2015) Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR).

  • Lee, CY., Osindero, S. (2016) Recursive recurrent nets with attention modeling for OCR in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2231–2239.

  • Li,H., Wang, P., Shen, C., Zhang, G. (2019) Show, attend and read: A simple and strong baseline for irregular text recognition. In: AAAI Conference on Artificial Intelligence (AAAI).

  • Liao, M., Lyu, P., He, M., Yao, C., Wu, W., Bai, X. (2019a) Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).

  • Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., Yao, C., Bai, X. (2019b) Scene text recognition from two-dimensional perspective. In: AAAI Conference on Artificial Intelligence (AAAI).

  • Lin, TY., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S. (2017) Feature pyramid networks for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2117–2125.

  • Liu, W., Chen, C., Wong, KYK., Su, Z., Han, J. (2016) STAR-Net: A spatial attention residue network for scene text recognition. In: British Machine Vision Conference (BMVC), pp 7–7.

  • Liu, W., Chen, C., Wong, KYK. (2018a) Char-net: A character-aware neural network for distorted scene text recognition. In: AAAI Conference on Artificial Intelligence (AAAI).

  • Liu, Y., Wang, Z., Jin, H., Wassell, I. (2018b) Synthetically supervised feature learning for scene text recognition. In: European Conference on Computer Vision (ECCV), pp 435–451.

  • Liu, Z., Li, Y., Ren ,F., Goh, WL., Yu, H. (2018c) SqueezedText: A real-time scene text recognition by binary convolutional encoder-decoder network. In: AAAI Conference on Artificial Intelligence (AAAI).

  • Lucas, SM., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R. (2003) ICDAR 2003 robust reading competitions. In: International Conference on Document Analysis and Recognition (ICDAR), pp 682–687.

  • Luo, C., Jin, L., & Sun, Z. (2019). MORAN: A multi-object rectified attention network for scene text recognition. Pattern Recognition, 90, 109–118.

    Article  Google Scholar 

  • Mao, X., Li, Q., Xie, H., Lau, RY., Wang, Z., Smolley, SP. (2017). Least squares generative adversarial networks. In: IEEE International Conference on Computer Vision (ICCV), pp 2813–2821.

  • Mishra, A., Alahari, K., Jawahar, C. (2012). Scene text recognition using higher order language priors. In: British Machine Vision Conference (BMVC), pp 1–11.

  • Neumann, L., Matas, J. (2012). Real-time scene text localization and recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3538–3545.

  • Odena, A., Olah, C., Shlens, J. (2017). Conditional image synthesis with auxiliary classifier GANs. In: International Conference on Machine Learning (ICML), pp 2642–2651.

  • Otsu, N. (1979). A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics (TSMC), 9(1), 62–66.

    Article  Google Scholar 

  • Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A. (2017). Automatic differentiation in PyTorch. Neural Information Processing Systems (NeurIPS) Autodiff Workshop.

  • Quy Phan, T., Shivakumara, P., Tian, S., Lim Tan, C. (2013) Recognizing text with perspective distortion in natural scenes. In: IEEE International Conference on Computer Vision (ICCV), pp 569–576.

  • Risnumawan, A., Shivakumara, P., Chan, C. S., & Tan, C. L. (2014). A robust arbitrary text detection system for natural scene images. Expert Systems with Applications, 41(18), 8027–8048.

    Article  Google Scholar 

  • Rodriguez-Serrano, J. A., Gordo, A., & Perronnin, F. (2015). Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision (IJCV), 113(3), 193–207.

    Article  Google Scholar 

  • Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X. (2016). Improved techniques for training GANs. Neural Information Processing Systems (NeurIPS), pp 2234–2242.

  • Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X. (2016). Robust scene text recognition with automatic rectification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4168–4176.

  • Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(11), 2298–2304.

    Article  Google Scholar 

  • Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & Bai, X. (2018). ASTER: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41, 2035.

    Article  Google Scholar 

  • Su, B., Lu, S. (2014). Accurate scene text recognition based on recurrent neural network. In: Asian Conference on Computer Vision (ACCV), pp 35–48.

  • Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Neural Information Processing Systems (NeurIPS), 2, 3104–3112.

    Google Scholar 

  • Wang, K., Babenko, B., Belongie, S. (2011). End-to-end scene text recognition. In: IEEE International Conference on Computer Vision (ICCV), pp 1457–1464.

  • Wang, T., Wu, DJ., Coates, A., Ng, AY. (2012). End-to-end text recognition with convolutional neural networks. In: IEEE International Conference on Pattern Recognition (ICPR), pp 3304–3308.

  • Wu, L., Zhang, C., Liu, J., Han, J., Liu, J., Ding, E., Bai, X. (2019). Editing text in the wild. In: ACM International Conference on Multimedia (ACM MM), pp 1500–1508.

  • Yang, M., Guan, Y., Liao, M., He, X., Bian, K., Bai, S., Yao, C., Bai, X. (2019a). Symmetry-constrained rectification network for scene text recognition. In: IEEE International Conference on Computer Vision (ICCV), pp 9147–9156.

  • Yang, S., Wang, Z., Wang, Z., Xu, N., Liu, J., Guo, Z. (2019b). Controllable artistic text style transfer via shape-matching GAN. In: IEEE International Conference on Computer Vision (ICCV).

  • Yang, X., He, D., Zhou, Z., Kifer, D., Giles, CL. (2017). Learning to read irregular text with attention mechanisms. In: International Joint Conferences on Artificial Intelligence (IJCAI), pp 3280–3286.

  • Yao, C., Bai, X., Shi, B., Liu, W. (2014). Strokelets: A learned multi-scale representation for scene text recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4042–4049.

  • Ye, Q., & Doermann, D. (2015). Text detection and recognition in imagery: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(7), 1480–1500 .

    Article  Google Scholar 

  • Zhan, F., Lu, S. (2019). ESIR: End-to-end scene text recognition via iterative image rectification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2059–2068.

  • Zhang, Y., Nie, S., Liu, W., Xu, X., Zhang, D., Shen, HT. (2019). Sequence-to-sequence domain adaptation network for robust text image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2740–2749.

  • Zhu, JY., Park, T., Isola, P., Efros, AA. (2017). Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2242–2251.

  • Zhu, Y., Yao, C., & Bai, X. (2016). Scene text detection and recognition: Recent advances and future trends. Frontiers of Computer Science, 10(1), 19–36.

    Article  Google Scholar 

Download references

Acknowledgements

This research was in part supported in part by NSFC (Grant No. 61936003), GD-NSF (No. 2017A030312006), the National Key Research and Development Program of China (No. 2016YFB1001405), and Fundamental Research Funds for Central Universities (x2dxD2190570).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lianwen Jin.

Additional information

Communicated by Cha Zhang, Ph.D.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Luo, C., Lin, Q., Liu, Y. et al. Separating Content from Style Using Adversarial Learning for Recognizing Text in the Wild. Int J Comput Vis 129, 960–976 (2021). https://doi.org/10.1007/s11263-020-01411-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01411-1

Keywords

Navigation