Advertisement

Adaptive Text Recognition Through Visual Matching

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12361)

Abstract

This work addresses the problems of generalization and flexibility for text recognition in documents. We introduce a new model that exploits the repetitive nature of characters in languages, and decouples the visual decoding and linguistic modelling stages through intermediate representations in the form of similarity maps. By doing this, we turn text recognition into a visual matching problem, thereby achieving generalization in appearance and flexibility in classes.

We evaluate the model on both synthetic and real datasets across different languages and alphabets, and show that it can handle challenges that traditional architectures are unable to solve without expensive re-training, including: (i) it can change the number of classes simply by changing the exemplars; and (ii) it can generalize to novel languages and characters (not in the training data) simply by providing a new glyph exemplar set. In essence, it is able to carry out one-shot sequence recognition. We also demonstrate that the model can generalize to unseen fonts without requiring new exemplars from them.

Code, data, and model checkpoints are available at: http://www.robots.ox.ac.uk/~vgg/research/FontAdaptor20/.

Keywords

Text recognition Sequence recognition Similarity maps 

Notes

Acknowledgements

This research is funded by a Google-DeepMind Graduate Scholarship and the EPSRC Programme Grant Seebibyte EP/M013774/1. We would like to thank Triantafyllos Afouras, Weidi Xie, Yang Liu and Erika Lu for discussions and proof-reading.

Supplementary material

504471_1_En_4_MOESM1_ESM.pdf (14 mb)
Supplementary material 1 (pdf 14311 KB)

References

  1. 1.
    EMNLP 2015 Tenth Workshop On Statistical Machine Translation. http://www.statmt.org/wmt15/
  2. 2.
    Baek, J., et al.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: Proceedings of ICCV (2019)Google Scholar
  3. 3.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014). arXiv preprint arXiv:1409.0473
  4. 4.
    Bunke, H., Bengio, S., Vinciarelli, A.: Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. PAMI 26(6), 709–720 (2004)CrossRefGoogle Scholar
  5. 5.
    Cao, K., Ji, J., Cao, Z., Chang, C.Y., Niebles, J.C.: Few-shot video classification via temporal alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10618–10627 (2020)Google Scholar
  6. 6.
    Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., Zhou, S.: Focusing attention: towards accurate text recognition in natural images. In: Proceedings of ICCV (2017) 4Google Scholar
  7. 7.
    Cheng, Z., Xu, Y., Bai, F., Niu, Y., Pu, S., Zhou, S.: Aon: towards arbitrarily-oriented text recognition. In: Proceedings of CVPR (2018)Google Scholar
  8. 8.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014)Google Scholar
  9. 9.
    Chowdhury, A., Vig, L.: An efficient end-to-end neural model for handwritten text recognition. In: Proceedings of BMVC (2018)Google Scholar
  10. 10.
    Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Proceedings of ACCV (2016)Google Scholar
  11. 11.
    Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of ICML (2017)Google Scholar
  12. 12.
    Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of ICML. ACM (2006)Google Scholar
  13. 13.
    Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)CrossRefGoogle Scholar
  14. 14.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings pf CVPR (2016)Google Scholar
  15. 15.
    He, P., Huang, W., Qiao, Y., Loy, C.C., Tang, X.: Reading scene text in deep convolutional sequences. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)Google Scholar
  16. 16.
    Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: Workshop on Deep Learning, NIPS (2014)Google Scholar
  17. 17.
    Jia, X., De Brabandere, B., Tuytelaars, T., Gool, L.V.: Dynamic filter networks. In: Proceedings of NIPS (2016)Google Scholar
  18. 18.
    Karatzas, D., et al.: ICDAR 2015 robust reading competition. In: Proceedings of ICDAR, pp. 1156–1160 (2015)Google Scholar
  19. 19.
    Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: Proceedings of ICDAR (2013) 3Google Scholar
  20. 20.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980
  21. 21.
    Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: Human-level concept learning through probabilistic program induction. Science 350, 1332–1338 (2015)MathSciNetCrossRefGoogle Scholar
  22. 22.
    LeCun, Y., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)CrossRefGoogle Scholar
  23. 23.
    Lee, C.Y., Osindero, S.: Recursive recurrent nets with attention modeling for OCR in the wild. In: Proceedings of CVPR (2016)Google Scholar
  24. 24.
    Liu, W., Chen, C., Wong, K.Y.K.: Char-net: a character-aware neural network for distorted scene text recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)Google Scholar
  25. 25.
    Maas, A., Xie, Z., Jurafsky, D., Ng, A.: Lexicon-free conversational speech recognition with neural networks. In: NAACL-HLT (2015)Google Scholar
  26. 26.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of ICML (2010)Google Scholar
  27. 27.
    Paszke, A., et al.: Automatic differentiation in pytorch (2017)Google Scholar
  28. 28.
    Pengyuan, L., Minghui, L., Cong, Y., Wenhao, W., Xiang, B.: Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. In: Proceedings of ECCV (2018)Google Scholar
  29. 29.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-24574-4_28CrossRefGoogle Scholar
  30. 30.
    Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. PAMI 39, 2298–2304 (2016)CrossRefGoogle Scholar
  31. 31.
    Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: Proceedings of CVPR (2016)Google Scholar
  32. 32.
    Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: an attentional scene text recognizer with flexible rectification. PAMI 41, 2035–2048 (2018)CrossRefGoogle Scholar
  33. 33.
    Smith, R.: An overview of the tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633. IEEE (2007)Google Scholar
  34. 34.
    Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Proceedings of NIPS (2017)Google Scholar
  35. 35.
    Su, B., Lu, S.: Accurate scene text recognition based on recurrent neural network. In: Proceedings of ACCV (2014)Google Scholar
  36. 36.
    Sung, F., et al.: Learning to compare: relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018)Google Scholar
  37. 37.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)Google Scholar
  38. 38.
    Vaswani, A., et al.: Attention is all you need. In: Proceedings of NIPS (2017)Google Scholar
  39. 39.
    Vincent, L.: Google book search: document understanding on a massive scale. In: Proccedings of Ninth International Conference on Document Analysis and Recognition (ICDAR), Washington, DC, pp. 819–823 (2007)Google Scholar
  40. 40.
    Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., Wierstra, D.: Matching networks for one shot learning. In: Proceedings of NIPS (2016)Google Scholar
  41. 41.
    Wang, K., Belongie, S.: Word spotting in the wild. In: Proceedings of ECCV (2010) 3Google Scholar
  42. 42.
    Wei, F., Wenhao, H., Fei, Y., Xu-Yao, Z., Cheng-Lin, L.: Textdragon: an end-to-end framework for arbitrary shaped text spotting. In: Proceedings of ICCV (2019)Google Scholar
  43. 43.
    Yang, L., Zhaowen, W., Hailin, J., Ian, W.: Synthetically supervised feature learning for scene text recognition. In: Proceedings of ECCV (2018)Google Scholar
  44. 44.
    Zhan, F., Lu, S.: Esir: end-to-end scene text recognition via iterative image rectification. In: Proceedings of CVPR (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Visual Geometry Group, Department of Engineering ScienceUniversity of OxfordOxfordUK
  2. 2.DeepMindLondonUK

Personalised recommendations