Advertisement

Synthetically Supervised Feature Learning for Scene Text Recognition

  • Yang LiuEmail author
  • Zhaowen Wang
  • Hailin Jin
  • Ian Wassell
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11209)

Abstract

We address the problem of image feature learning for scene text recognition. The image features in the state-of-the-art methods are learned from large-scale synthetic image datasets. However, most methods only rely on outputs of the synthetic data generation process, namely realistically looking images, and completely ignore the rest of the process. We propose to leverage the parameters that lead to the output images to improve image feature learning. Specifically, for every image out of the data generation process, we obtain the associated parameters and render another “clean” image that is free of select distortion factors that are applied to the output image. Because of the absence of distortion factors, the clean image tends to be easier to recognize than the original image which can serve as supervision. We design a multi-task network with an encoder-discriminator-generator architecture to guide the feature of the original image toward that of the clean image. The experiments show that our method significantly outperforms the state-of-the-art methods on standard scene text recognition benchmarks in the lexicon-free category. Furthermore, we show that without explicit handling, our method works on challenging cases where input images contain severe geometric distortion, such as text on a curved path.

Keywords

Scene text recognition Deep learning Neural networks Feature learning Synthetic data Multi-task learning 

References

  1. 1.
    Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36(12), 2552–2566 (2014)CrossRefGoogle Scholar
  2. 2.
    Berthelot, D., Schumm, T., Metz, L.: Began: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717 (2017)
  3. 3.
    Bissacco, A., Cummins, M., Netzer, Y., Neven, H.: PhotoOCR: reading text in uncontrolled conditions. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 785–792. IEEE (2013)Google Scholar
  4. 4.
    Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: towards accurate text recognition in natural images. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5086–5094. IEEE (2017)Google Scholar
  5. 5.
    Ghosh, S.K., Valveny, E., Bagdanov, A.D.: Visual attention models for scene text recognition. In: ICDAR (2017)Google Scholar
  6. 6.
    Goel, V., Mishra, A., Alahari, K., Jawahar, C.: Whole is greater than sum of parts: recognizing scene text words. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 398–402. IEEE (2013)Google Scholar
  7. 7.
    Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)Google Scholar
  8. 8.
    Gordo, A.: Supervised mid-level features for word image representation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2956–2964 (2015)Google Scholar
  9. 9.
    Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)Google Scholar
  10. 10.
    Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649. IEEE (2013)Google Scholar
  11. 11.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  12. 12.
  13. 13.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. arXiv preprint (2017)Google Scholar
  14. 14.
    Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: NIPS Deep Learning Workshop (2014)Google Scholar
  15. 15.
    Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 512–528. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10593-2_34CrossRefGoogle Scholar
  16. 16.
    Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Deep structured output learning for unconstrained text recognition. In: International Conference on Learning Representations (2015)Google Scholar
  17. 17.
    Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. 116(1), 1–20 (2016)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 1484–1493. IEEE (2013)Google Scholar
  19. 19.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  20. 20.
    Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for ocr in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2231–2239 (2016)Google Scholar
  21. 21.
    Lucas, S.M., et al.: ICDAR 2003 robust reading competitions: entries, results, and future directions. Int. J. Doc. Anal. Recognit. (IJDAR) 7(2–3), 105–122 (2005)CrossRefGoogle Scholar
  22. 22.
    Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
  23. 23.
    Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: BMVC 2012–23rd British Machine Vision Conference. BMVA (2012)Google Scholar
  24. 24.
    Peng, X., Yu, X., Sohn, K., Metaxas, D.N., Chandraker, M.: Reconstruction-based disentanglement for pose-invariant face recognition. Intervals 20, 12 (2017)Google Scholar
  25. 25.
    Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspective distortion in natural scenes. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 569–576. IEEE (2013)Google Scholar
  26. 26.
    Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. Expert. Syst. Appl. 41(18), 8027–8048 (2014)CrossRefGoogle Scholar
  27. 27.
    Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)CrossRefGoogle Scholar
  28. 28.
    Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4168–4176 (2016)Google Scholar
  29. 29.
    Su, B., Lu, S.: Accurate scene text recognition based on recurrent neural network. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9003, pp. 35–48. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-16865-4_3CrossRefGoogle Scholar
  30. 30.
    Tran, L., Yin, X., Liu, X.: Disentangled representation learning GAN for pose-invariant face recognition. In: CVPR (2017)Google Scholar
  31. 31.
    Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1457–1464. IEEE (2011)Google Scholar
  32. 32.
    Wang, T., Wu, D.J., Coates, A., Ng, A.Y.: End-to-end text recognition with convolutional neural networks. In: ICPR (2012)Google Scholar
  33. 33.
    Yao, C., Bai, X., Shi, B., Liu, W.: Strokelets: a learned multi-scale representation for scene text recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4042–4049 (2014)Google Scholar
  34. 34.
    Ye, Q., Doermann, D.: Text detection and recognition in imagery: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1480–1500 (2015)CrossRefGoogle Scholar
  35. 35.
    Yim, J., Jung, H., Yoo, B., Choi, C., Park, D., Kim, J.: Rotating your face using multi-task deep neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 676–684 (2015)Google Scholar
  36. 36.
    Yin, X.C., Zuo, Z.Y., Tian, S., Liu, C.L.: Text detection, tracking and recognition in video: a comprehensive survey. IEEE Trans. Image Process. 25(6), 2752–2773 (2016)MathSciNetCrossRefGoogle Scholar
  37. 37.
    Zhou, X., et al.: East: an efficient and accurate scene text detector. arXiv preprint arXiv:1704.03155 (2017)
  38. 38.
    Zhu, Y., Yao, C., Bai, X.: Scene text detection and recognition: recent advances and future trends. Front. Comput. Sci. 10, 19–36 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Yang Liu
    • 1
    Email author
  • Zhaowen Wang
    • 2
  • Hailin Jin
    • 2
  • Ian Wassell
    • 1
  1. 1.Computer LaboratoryUniversity of CambridgeCambridgeUK
  2. 2.Adobe ResearchSan JoseUSA

Personalised recommendations