Advertisement

Large Scale Scene Text Verification with Guided Attention

  • Dafang HeEmail author
  • Yeqing Li
  • Alexander Gorban
  • Derrall Heath
  • Julian Ibarz
  • Qian Yu
  • Daniel Kifer
  • C. Lee Giles
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11365)

Abstract

Many tasks are related to determining if a particular text string exists in an image. In this work, we propose a new framework that learns this task in an end-to-end way. The framework takes an image and a text string as input and then outputs the probability of the text string being present in the image. This is the first end-to-end framework that learns such relationships between text and images in scene text area. The framework does not require explicit scene text detection or recognition and thus no bounding box annotations are needed. It is also the first work in scene text area that tackles such a weakly labeled problem. Based on this framework, we developed a model called Guided Attention. Our designed model achieves better results than several state-of-the-art scene text reading based solutions for a challenging Street View Business Matching task. The task tries to find correct business names for storefront images and the dataset we collected for it is substantially larger, and more challenging than existing scene text dataset. This new real-world task provides a new perspective for studying scene text related problems.

Keywords

Scene text Verification End to end model Attention Weakly labeled dataset 

References

  1. 1.
    Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)Google Scholar
  2. 2.
    Chen, X., Yuille, A.L.: Detecting and reading text in natural scenes. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, vol. 2, p. II-366. IEEE (2004)Google Scholar
  3. 3.
    Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2963–2970. IEEE (2010)Google Scholar
  4. 4.
    Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)Google Scholar
  5. 5.
    He, D., et al.: Multi-scale FCN with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
  6. 6.
    He, P., Huang, W., Qiao, Y., Loy, C., Tang, X.: Reading scene text in deep convolutional sequences. In: AAAI Conference on Artificial Intelligence (2016)Google Scholar
  7. 7.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  8. 8.
    Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. 116(1), 1–20 (2016)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Karaoglu, S., Tao, R., Gevers, T., Smeulders, A.W.: Words matter: scene text for image classification and retrieval. IEEE Trans. Multimed. 19(5), 1063–1076 (2017)CrossRefGoogle Scholar
  10. 10.
    Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. IEEE (2015)Google Scholar
  11. 11.
    Mishra, A., Alahari, K., Jawahar, C.: Image retrieval using textual cues. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3040–3047 (2013)Google Scholar
  12. 12.
    Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3538–3545. IEEE (2012)Google Scholar
  13. 13.
    Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking segments. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
  14. 14.
    Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2298–2304 (2016)CrossRefGoogle Scholar
  15. 15.
    Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4168–4176 (2016)Google Scholar
  16. 16.
    Smith, R., et al.: End-to-end interpretation of the French street name signs dataset. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 411–426. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46604-0_30CrossRefGoogle Scholar
  17. 17.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)Google Scholar
  18. 18.
    Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_4CrossRefGoogle Scholar
  19. 19.
    Vinyals, O., Kaiser, Ł., Koo, T., Petrov, S., Sutskever, I., Hinton, G.: Grammar as a foreign language. In: Advances in Neural Information Processing Systems, pp. 2773–2781 (2015)Google Scholar
  20. 20.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)Google Scholar
  21. 21.
    Wojna, Z., et al.: Attention-based extraction of structured information from street view imagery. arXiv preprint arXiv:1704.03549 (2017)
  22. 22.
    Yang, X., He, D., Zhou, Z., Kifer, D., Giles, C.L.: Learning to read irregular text with attention mechanisms. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, pp. 3280–3286 (2017).  https://doi.org/10.24963/ijcai.2017/458
  23. 23.
    Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3441–3450 (2015)Google Scholar
  24. 24.
    Yu, Q., et al.: Large scale business discovery from street level imagery. arXiv preprint arXiv:1512.05430 (2015)
  25. 25.
    Zhang, Y., Gueguen, L., Zharkov, I., Zhang, P., Seifert, K., Kadlec, B.: Uber-text: a large-scale dataset for optical character recognition from street-level imagery. In: SUNw: Scene Understanding Workshop - CVPR 2017, Hawaii, USA (2017)Google Scholar
  26. 26.
    Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X.: Multi-oriented text detection with fully convolutional networks. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, June 2016Google Scholar
  27. 27.
    Zhou, X., et al.: East: an efficient and accurate scene text detector. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Dafang He
    • 1
    Email author
  • Yeqing Li
    • 2
  • Alexander Gorban
    • 2
  • Derrall Heath
    • 2
  • Julian Ibarz
    • 2
  • Qian Yu
    • 2
  • Daniel Kifer
    • 1
  • C. Lee Giles
    • 1
  1. 1.The Pennsylvania State UniversityUniversity ParkUSA
  2. 2.Google Inc.Mountain ViewUSA

Personalised recommendations