Separating Content from Style Using Adversarial Learning for Recognizing Text in the Wild

Luo, Canjie; Lin, Qingxiang; Liu, Yuliang; Jin, Lianwen; Shen, Chunhua

doi:10.1007/s11263-020-01411-1

Separating Content from Style Using Adversarial Learning for Recognizing Text in the Wild

Published: 05 January 2021

Volume 129, pages 960–976, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Canjie Luo¹,
Qingxiang Lin¹,
Yuliang Liu^1,2,
Lianwen Jin ORCID: orcid.org/0000-0002-5456-0957^1,3 &
…
Chunhua Shen^2,4

1224 Accesses
25 Citations
Explore all metrics

Abstract

Scene text recognition is an important task in computer vision. Despite tremendous progress achieved in the past few years, issues such as varying font styles, arbitrary shapes and complex backgrounds etc. have made the problem very challenging. In this work, we propose to improve text recognition from a new perspective by separating the text content from complex backgrounds, thus making the recognition considerably easier and significantly improving recognition accuracy. To this end, we exploit the generative adversarial networks (GANs) for removing backgrounds while retaining the text content . As vanilla GANs are not sufficiently robust to generate sequence-like characters in natural images, we propose an adversarial learning framework for the generation and recognition of multiple characters in an image. The proposed framework consists of an attention-based recognizer and a generative adversarial architecture. Furthermore, to tackle the issue of lacking paired training samples, we design an interactive joint training scheme, which shares attention masks from the recognizer to the discriminator, and enables the discriminator to extract the features of each character for further adversarial training. Benefiting from the character-level adversarial training, our framework requires only unpaired simple data for style supervision. Each target style sample containing only one randomly chosen character can be simply synthesized online during the training. This is significant as the training does not require costly paired samples or character-level annotations. Thus, only the input images and corresponding text labels are needed. In addition to the style normalization of the backgrounds, we refine character patterns to ease the recognition task. A feedback mechanism is proposed to bridge the gap between the discriminator and the recognizer. Therefore, the discriminator can guide the generator according to the confusion of the recognizer, so that the generated patterns are clearer for recognition. Experiments on various benchmarks, including both regular and irregular text, demonstrate that our method significantly reduces the difficulty of recognition. Our framework can be integrated into recent recognition methods to achieve new state-of-the-art recognition accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 8

A survey on Image Data Augmentation for Deep Learning

Article Open access 06 July 2019

Image Matching from Handcrafted to Deep Features: A Survey

Article Open access 04 August 2020

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Notes

https://fonts.google.com
https://pillow.readthedocs.io/en/stable/reference/ImageDraw.html
The official implementation is available on https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

References

Arjovsky, M., Chintala, S., Bottou, L. (2017) Wasserstein generative adversarial networks. In: International Conference on Machine Learning (ICML), pp 214–223.
Azadi, S., Fisher, M., Kim, VG., Wang, Z., Shechtman, E., Darrell, T. (2018) Multi-content gan for few-shot font style transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 7564–7573.
Bahdanau, D., Cho, K., Bengio, Y. (2015) Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (ICLR).
Bai, F., Cheng, Z., Niu, Y., Pu, S., Zhou, S. (2018) Edit probability for scene text recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1508–1516.
Bau, D., Zhu, JY., Wulff, J., Peebles, W., Strobelt, H., Zhou, B., Torralba, A. (2019) Seeing what a gan cannot generate. In: IEEE International Conference on Computer Vision (ICCV), pp 4502–4511.
Berthelot, D., Schumm, T., Metz, L. (2017) BEGAN: boundary equilibrium generative adversarial networks. CoRR abs/1703.10717.
Bissacco, A., Cummins, M., Netzer, Y., Neven, H. (2013) PhotoOCR: Reading text in uncontrolled conditions. In: IEEE International Conference on Computer Vision (ICCV), pp 785–792.
Casey, R. G., & Lecolinet, E. (1996). A survey of methods and strategies in character segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 18(7), 690–706.
Article Google Scholar
Cheng, M. M., Liu, X. C., Wang, J., Lu, S. P., Lai, Y. K., & Rosin, P. L. (2019). Structure-Preserving Neural Style Transfer. IEEE Transactions on Image Processing (TIP), 29, 909–920.
Article MathSciNet Google Scholar
Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S. (2017) Focusing attention: Towards accurate text recognition in natural images. In: IEEE International Conference on Computer Vision (ICCV), pp 5086–5094.
Cheng, Z., Xu, Y., Bai, F., Niu, Y., Pu, S., Zhou, S. (2018) AON: Towards arbitrarily-oriented text recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5571–5579.
Cong, F., Hu, W., Huo, Q., Guo, L. (2019) A comparative study of attention-based encoder-decoder approaches to natural scene text recognition. In: International Conference on Document Analysis and Recognition (ICDAR), pp 916–921.
Dalal, N., Triggs, B. (2005) Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 886–893.
Fang, S., Xie, H., Chen, J., Tan, J., Zhang, Y. (2019) Learning to draw text in natural images with conditional adversarial networks. In: International Joint Conferences on Artificial Intelligence (IJCAI).
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y. (2014) Generative adversarial nets. Neural Information Processing Systems (NeurIPS), pp 2672–2680.
Gordo, A. (2015) Supervised mid-level features for word image representation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2956–2964.
Gupta, A., Vedaldi, A., Zisserman, A. (2016) Synthetic data for text localisation in natural images. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2315–2324.
He, K., Zhang, X., Ren, S., Sun, J. (2016a) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778.
He, P., Huang, W., Qiao, Y., Loy, CC., Tang, X. (2016b) Reading scene text in deep convolutional sequences. In: AAAI Conference on Artificial Intelligence (AAAI), pp 3501–3508.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Neural Information Processing Systems (NeurIPS), 30, 6626–6637.
Google Scholar
Isola, P., Zhu, JY., Zhou, T., Efros, AA. (2017) Image-to-image translation with conditional adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5967–5976.
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A. (2014a) Synthetic data and artificial neural networks for natural scene text recognition. Neural Information Processing Systems (NeurIPS) Deep Learning Workshop.
Jaderberg, M., Vedaldi, A., Zisserman, A. (2014b) Deep features for text spotting. In: European Conference on Computer Vision (ECCV), pp 512–528.
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A. (2015) Deep structured output learning for unconstrained text recognition. In: International Conference on Learning Representations (ICLR).
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International Journal of Computer Vision (IJCV), 116(1), 1–20.
Article MathSciNet Google Scholar
Jing, Y., Yang, Y., Feng, Z., Ye, J., Yu, Y., & Song, M. (2019). Neural style transfer: A review. IEEE Transactions on Visualization and Computer Graphics (TVCG), 26(11), 3365–3385.
Article Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L. (2016) Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision (ECCV), pp 694–711.
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M, i Bigorda, LG., Mestre, SR., Mas, J., Mota, DF., Almazan, JA., De Las Heras, LP. (2013) ICDAR 2013 robust reading competition. In: International Conference on Document Analysis and Recognition (ICDAR), pp 1484–1493.
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, VR., Lu, S., et al. (2015) ICDAR 2015 competition on robust reading. In: International Conference on Document Analysis and Recognition (ICDAR), pp 1156–1160.
Kingma, D., Ba, L., et al. (2015) Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR).
Lee, CY., Osindero, S. (2016) Recursive recurrent nets with attention modeling for OCR in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2231–2239.
Li,H., Wang, P., Shen, C., Zhang, G. (2019) Show, attend and read: A simple and strong baseline for irregular text recognition. In: AAAI Conference on Artificial Intelligence (AAAI).
Liao, M., Lyu, P., He, M., Yao, C., Wu, W., Bai, X. (2019a) Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., Yao, C., Bai, X. (2019b) Scene text recognition from two-dimensional perspective. In: AAAI Conference on Artificial Intelligence (AAAI).
Lin, TY., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S. (2017) Feature pyramid networks for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2117–2125.
Liu, W., Chen, C., Wong, KYK., Su, Z., Han, J. (2016) STAR-Net: A spatial attention residue network for scene text recognition. In: British Machine Vision Conference (BMVC), pp 7–7.
Liu, W., Chen, C., Wong, KYK. (2018a) Char-net: A character-aware neural network for distorted scene text recognition. In: AAAI Conference on Artificial Intelligence (AAAI).
Liu, Y., Wang, Z., Jin, H., Wassell, I. (2018b) Synthetically supervised feature learning for scene text recognition. In: European Conference on Computer Vision (ECCV), pp 435–451.
Liu, Z., Li, Y., Ren ,F., Goh, WL., Yu, H. (2018c) SqueezedText: A real-time scene text recognition by binary convolutional encoder-decoder network. In: AAAI Conference on Artificial Intelligence (AAAI).
Lucas, SM., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R. (2003) ICDAR 2003 robust reading competitions. In: International Conference on Document Analysis and Recognition (ICDAR), pp 682–687.
Luo, C., Jin, L., & Sun, Z. (2019). MORAN: A multi-object rectified attention network for scene text recognition. Pattern Recognition, 90, 109–118.
Article Google Scholar
Mao, X., Li, Q., Xie, H., Lau, RY., Wang, Z., Smolley, SP. (2017). Least squares generative adversarial networks. In: IEEE International Conference on Computer Vision (ICCV), pp 2813–2821.
Mishra, A., Alahari, K., Jawahar, C. (2012). Scene text recognition using higher order language priors. In: British Machine Vision Conference (BMVC), pp 1–11.
Neumann, L., Matas, J. (2012). Real-time scene text localization and recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3538–3545.
Odena, A., Olah, C., Shlens, J. (2017). Conditional image synthesis with auxiliary classifier GANs. In: International Conference on Machine Learning (ICML), pp 2642–2651.
Otsu, N. (1979). A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics (TSMC), 9(1), 62–66.
Article Google Scholar
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A. (2017). Automatic differentiation in PyTorch. Neural Information Processing Systems (NeurIPS) Autodiff Workshop.
Quy Phan, T., Shivakumara, P., Tian, S., Lim Tan, C. (2013) Recognizing text with perspective distortion in natural scenes. In: IEEE International Conference on Computer Vision (ICCV), pp 569–576.
Risnumawan, A., Shivakumara, P., Chan, C. S., & Tan, C. L. (2014). A robust arbitrary text detection system for natural scene images. Expert Systems with Applications, 41(18), 8027–8048.
Article Google Scholar
Rodriguez-Serrano, J. A., Gordo, A., & Perronnin, F. (2015). Label embedding: A frugal baseline for text recognition. International Journal of Computer Vision (IJCV), 113(3), 193–207.
Article Google Scholar
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X. (2016). Improved techniques for training GANs. Neural Information Processing Systems (NeurIPS), pp 2234–2242.
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X. (2016). Robust scene text recognition with automatic rectification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4168–4176.
Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(11), 2298–2304.
Article Google Scholar
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & Bai, X. (2018). ASTER: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41, 2035.
Article Google Scholar
Su, B., Lu, S. (2014). Accurate scene text recognition based on recurrent neural network. In: Asian Conference on Computer Vision (ACCV), pp 35–48.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Neural Information Processing Systems (NeurIPS), 2, 3104–3112.
Google Scholar
Wang, K., Babenko, B., Belongie, S. (2011). End-to-end scene text recognition. In: IEEE International Conference on Computer Vision (ICCV), pp 1457–1464.
Wang, T., Wu, DJ., Coates, A., Ng, AY. (2012). End-to-end text recognition with convolutional neural networks. In: IEEE International Conference on Pattern Recognition (ICPR), pp 3304–3308.
Wu, L., Zhang, C., Liu, J., Han, J., Liu, J., Ding, E., Bai, X. (2019). Editing text in the wild. In: ACM International Conference on Multimedia (ACM MM), pp 1500–1508.
Yang, M., Guan, Y., Liao, M., He, X., Bian, K., Bai, S., Yao, C., Bai, X. (2019a). Symmetry-constrained rectification network for scene text recognition. In: IEEE International Conference on Computer Vision (ICCV), pp 9147–9156.
Yang, S., Wang, Z., Wang, Z., Xu, N., Liu, J., Guo, Z. (2019b). Controllable artistic text style transfer via shape-matching GAN. In: IEEE International Conference on Computer Vision (ICCV).
Yang, X., He, D., Zhou, Z., Kifer, D., Giles, CL. (2017). Learning to read irregular text with attention mechanisms. In: International Joint Conferences on Artificial Intelligence (IJCAI), pp 3280–3286.
Yao, C., Bai, X., Shi, B., Liu, W. (2014). Strokelets: A learned multi-scale representation for scene text recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4042–4049.
Ye, Q., & Doermann, D. (2015). Text detection and recognition in imagery: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(7), 1480–1500 .
Article Google Scholar
Zhan, F., Lu, S. (2019). ESIR: End-to-end scene text recognition via iterative image rectification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2059–2068.
Zhang, Y., Nie, S., Liu, W., Xu, X., Zhang, D., Shen, HT. (2019). Sequence-to-sequence domain adaptation network for robust text image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2740–2749.
Zhu, JY., Park, T., Isola, P., Efros, AA. (2017). Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2242–2251.
Zhu, Y., Yao, C., & Bai, X. (2016). Scene text detection and recognition: Recent advances and future trends. Frontiers of Computer Science, 10(1), 19–36.
Article Google Scholar

Download references

Acknowledgements

This research was in part supported in part by NSFC (Grant No. 61936003), GD-NSF (No. 2017A030312006), the National Key Research and Development Program of China (No. 2016YFB1001405), and Fundamental Research Funds for Central Universities (x2dxD2190570).

Author information

Authors and Affiliations

South China University of Technology, Guangzhou, China
Canjie Luo, Qingxiang Lin, Yuliang Liu & Lianwen Jin
The University of Adelaide, Adelaide, Australia
Yuliang Liu & Chunhua Shen
SCUT-Zhuhai Institute of Modern Industrial Innovation, Guangzhou, China
Lianwen Jin
Monash University, Melbourne, Australia
Chunhua Shen

Authors

Canjie Luo
View author publications
You can also search for this author in PubMed Google Scholar
Qingxiang Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yuliang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Lianwen Jin
View author publications
You can also search for this author in PubMed Google Scholar
Chunhua Shen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lianwen Jin.

Additional information

Communicated by Cha Zhang, Ph.D.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luo, C., Lin, Q., Liu, Y. et al. Separating Content from Style Using Adversarial Learning for Recognizing Text in the Wild. Int J Comput Vis 129, 960–976 (2021). https://doi.org/10.1007/s11263-020-01411-1

Download citation

Received: 19 December 2019
Accepted: 28 November 2020
Published: 05 January 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s11263-020-01411-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Separating Content from Style Using Adversarial Learning for Recognizing Text in the Wild

Abstract

Access this article

Similar content being viewed by others

A survey on Image Data Augmentation for Deep Learning

Image Matching from Handcrafted to Deep Features: A Survey

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Separating Content from Style Using Adversarial Learning for Recognizing Text in the Wild

Abstract

Access this article

Similar content being viewed by others

A survey on Image Data Augmentation for Deep Learning

Image Matching from Handcrafted to Deep Features: A Survey

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation