Synthesizing data for text recognition with style transfer

  • Jiahui Li
  • Siwei Wang
  • Yongtao WangEmail author
  • Zhi Tang


Most of the existing datasets for scene text recognition merely consist of a few thousand training samples with a very limited vocabulary, which cannot meet the requirement of the state-of-the-art deep learning based text recognition methods. Meanwhile, although the synthetic datasets (e.g., SynthText90k) usually contain millions of samples, they cannot fit the data distribution of the small target datasets in natural scenes completely. To address these problems, we propose a word data generating method called SynthText-Transfer, which is capable of emulating the distribution of the target dataset. SynthText-Transfer uses a style transfer method to generate samples with arbitray text content, which preserve the texture of the reference sample in the target dataset. The generated images are not only visibly similar with real images, but also capable of improving the accuracy of the state-of-the-art text recognition methods, especially for the English and Chinese dataset with a large alphabet (in which many characters only appear in few samples, making it hard to learn for sequence models). Moreover, the proposed method is fast and flexible, with a competitive speed among common style transfer methods.


Synthetic data Text recognition Style transfer Data augmentation 



This work is supported by National Natural Science Foundation of China under Grant 61673029. This work is also a research achievement of Key Laboratory of Science, Technology and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology).


  1. 1.
    Chen T Q, Schmidt M (2016) Fast patch-based style transfer of arbitrary style. arXiv:1612.04337
  2. 2.
    Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database Computer vision and pattern recognition, 2009. CVPR 2009. IEEE conference on, pp. 248–255. IeeeGoogle Scholar
  3. 3.
    Duck S Y (2016) Painter by numbers.
  4. 4.
    Gatys L, Ecker A, Bethge M (2015) A neural algorithm of artistic style. Nature communicationsGoogle Scholar
  5. 5.
    Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680Google Scholar
  6. 6.
    Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 369–376Google Scholar
  7. 7.
    Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2315–2324Google Scholar
  8. 8.
    He P, Huang W, Qiao Y, Loy C C, Tang X (2016) Reading scene text in deep convolutional sequences. In: AAAI, vol 16, pp 3501–3508Google Scholar
  9. 9.
    Huang X, Belongie S J (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV, pp 1510–1519Google Scholar
  10. 10.
    Jaderberg M, Simonyan K, Vedaldi A, Zisserman A (2014) Synthetic data and artificial neural networks for natural scene text recognition. arXiv:1406.2227
  11. 11.
    Karatzas D, Shafait F, Uchida S, Iwamura M, i Bigorda L G, Mestre S R, Mas J, Mota D F, Almazan J A, De Las Heras L P (2013) Icdar 2013 robust reading competition. In: Document analysis and recognition (ICDAR), 2013 12th international conference on. IEEE, pp 1484–1493Google Scholar
  12. 12.
    Lang K (1995) Newsweeder: Learning to filter netnews. In: Machine learning proceedings 1995. Elsevier, pp 331–339Google Scholar
  13. 13.
    Li C, Wand M (2016) Combining markov random fields and convolutional neural networks for image synthesis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2479–2486Google Scholar
  14. 14.
    Li Y, Fang C, Yang J, Wang Z, Lu X, Yang M H (2017) Universal style transfer via feature transforms. In: Advances in neural information processing systems, pp 386–396Google Scholar
  15. 15.
    Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755Google Scholar
  16. 16.
    Liu X, Liang D, Yan S, Chen D, Qiao Y, Yan J (2018) Fots: fast oriented text spotting with a unified network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5676–5685Google Scholar
  17. 17.
    Lucas S M, Panaretos A, Sosa L, Tang A, Wong S, Young R, Ashida K, Nagai H, Okamoto M, Yamamoto H et al (2005) Icdar 2003 robust reading competitions: entries, results, and future directions. Int J Doc Anal Recogn (IJDAR) 7 (2-3):105–122CrossRefGoogle Scholar
  18. 18.
    Mishra A, Alahari K, Jawahar C (2012) Scene text recognition using higher order language priors. In: BMVC-British machine vision conference. BMVAGoogle Scholar
  19. 19.
    Ravi S, Larochelle H (2016) Optimization as a model for few-shot learningGoogle Scholar
  20. 20.
    Risser E, Wilmot P, Barnes C (2017) Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv:1701.08893
  21. 21.
    Shi B, Bai X, Yao C (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39(11):2298–2304CrossRefGoogle Scholar
  22. 22.
    Shi B, Wang X, Lyu P, Yao C, Bai X (2016) Robust scene text recognition with automatic rectification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4168–4176Google Scholar
  23. 23.
    Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLRGoogle Scholar
  24. 24.
    Smith R, Gu C, Lee D S, Hu H, Unnikrishnan R, Ibarz J, Arnoud S, Lin S (2016) End-to-end interpretation of the french street name signs dataset. In: European conference on computer vision. Springer, pp 411–426Google Scholar
  25. 25.
    Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: Computer vision (ICCV), 2017 IEEE international conference on. IEEE, pp 843–852Google Scholar
  26. 26.
    Wang K, Babenko B, Belongie S (2011) End-to-end scene text recognition. In: Computer vision (ICCV), 2011 IEEE international conference on. IEEE, pp 1457–1464Google Scholar
  27. 27.
    Wang Y, Bai X, Liu C-L (2018) Icpr mtwi 2018 challenge 1 text recognition of web images.
  28. 28.
    Yeh R A, Chen C, Lim T Y, Schwing A G, Hasegawa-Johnson M, Do M N (2017) Semantic image inpainting with deep generative models. In: CVPR, vol 2, p 4Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Institute of Computer Science and TechnologyPeking UniversityBeijingChina

Personalised recommendations