Skip to main content

TextMatcher: Cross-Attentional Neural Network to Compare Image and Text

  • Conference paper
  • First Online:
Discovery Science (DS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13601))

Included in the following conference series:

Abstract

We study a multimodal-learning problem where, given an image containing a single-line (printed or handwritten) text and a candidate text transcription, the goal is to assess whether the text represented in the image corresponds to the candidate text. This problem, which we dub text matching, is primarily motivated by a real industrial application scenario of automated cheque processing, whose goal is to automatically assess whether the information in a bank cheque (e.g., issue date) match the data that have been entered by the customer while depositing the cheque to an automated teller machine (ATM). The problem finds more general application in several other scenarios too, e.g., personal-identity-document processing in user-registration procedures.

We devise a machine-learning model specifically designed for the text-matching problem. The proposed model, termed TextMatcher, compares the two inputs by applying a novel cross-attention mechanism over the embedding representations of image and text, and it is trained in an end-to-end fashion on the desired distribution of errors to be detected. We demonstrate the effectiveness of TextMatcher on the automated-cheque-processing use case, where TextMatcher is shown to generalize well to future unseen dates, unlike existing models designed for related problems. We further assess the performance of TextMatcher on different distributions of errors on the public IAM dataset. Results attest that, compared to a naïve model and existing models for related problems, TextMatcher achieves higher performance on a variety of configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    TextMatcher has been deployed at UniCredit, and it is currently used in production.

  2. 2.

    https://github.com/ayumiymk/aster.pytorch.

References

  1. Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE TPAMI 36(12), 2552–2566 (2014)

    Article  Google Scholar 

  2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)

    Google Scholar 

  3. Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE TPAMI 41(2), 423–443 (2018)

    Article  Google Scholar 

  4. Chen, X., Jin, L., Zhu, Y., Luo, C., Wang, T.: Text recognition in the wild: a survey. ACM CSUR 54(2), 42:1-42:35 (2021)

    Google Scholar 

  5. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, pp. 249–256 (2010)

    Google Scholar 

  6. Gómez, L., Rusinol, M., Karatzas, D.: LSDE: levenshtein space deep embedding for query-by-string word spotting. In: ICDAR, pp. 499–504 (2017)

    Google Scholar 

  7. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR, pp. 1735–1742 (2006)

    Google Scholar 

  8. Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV, pp. 212–228 (2018)

    Google Scholar 

  9. Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. IJDAR 5(1), 39–46 (2002)

    Article  MATH  Google Scholar 

  10. Mhiri, M., Desrosiers, C., Cheriet, M.: Word spotting and recognition via a joint deep embedding of image and text. Pattern Recogn. 88, 312–320 (2019)

    Article  Google Scholar 

  11. Qi, X., Zhang, Y., Qi, J., Lu, H.: Self-attention guided representation learning for image-text matching. Neurocomputing 450, 143–155 (2021)

    Article  Google Scholar 

  12. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)

    Google Scholar 

  13. Rath, T.M., Manmatha, R.: Word spotting for historical documents. IJDAR 9(2), 139–152 (2007)

    Article  Google Scholar 

  14. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR, pp. 779–788 (2016)

    Google Scholar 

  15. Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: CVPR, pp. 4168–4176 (2016)

    Google Scholar 

  16. Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: an attentional scene text recognizer with flexible rectification. IEEE TPAMI 41(9), 2035–2048 (2018)

    Article  Google Scholar 

  17. Sudholt, S., Fink, G.A.: PHOCNet: a deep convolutional neural network for word spotting in handwritten documents. In: ICFHR, pp. 277–282 (2016)

    Google Scholar 

  18. Sudholt, S., Fink, G.A.: Attribute CNNs for word spotting in handwritten documents. IJDAR 21(3), 199–218 (2018)

    Article  Google Scholar 

  19. Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)

    Google Scholar 

  20. Wang, Y., et al.: Position focused attention network for image-text matching. In: IJCAI, pp. 3792–3798 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Valentina Arrigoni .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Arrigoni, V., Repele, L., Saccavino, D.M. (2022). TextMatcher: Cross-Attentional Neural Network to Compare Image and Text. In: Pascal, P., Ienco, D. (eds) Discovery Science. DS 2022. Lecture Notes in Computer Science(), vol 13601. Springer, Cham. https://doi.org/10.1007/978-3-031-18840-4_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-18840-4_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-18839-8

  • Online ISBN: 978-3-031-18840-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics