TextMatcher: Cross-Attentional Neural Network to Compare Image and Text

Arrigoni, Valentina; Repele, Luisa; Saccavino, Dario Marino

doi:10.1007/978-3-031-18840-4_25

Valentina Arrigoni⁹,
Luisa Repele⁹ &
Dario Marino Saccavino⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13601))

Included in the following conference series:

International Conference on Discovery Science

858 Accesses
2 Citations

Abstract

We study a multimodal-learning problem where, given an image containing a single-line (printed or handwritten) text and a candidate text transcription, the goal is to assess whether the text represented in the image corresponds to the candidate text. This problem, which we dub text matching, is primarily motivated by a real industrial application scenario of automated cheque processing, whose goal is to automatically assess whether the information in a bank cheque (e.g., issue date) match the data that have been entered by the customer while depositing the cheque to an automated teller machine (ATM). The problem finds more general application in several other scenarios too, e.g., personal-identity-document processing in user-registration procedures.

We devise a machine-learning model specifically designed for the text-matching problem. The proposed model, termed TextMatcher, compares the two inputs by applying a novel cross-attention mechanism over the embedding representations of image and text, and it is trained in an end-to-end fashion on the desired distribution of errors to be detected. We demonstrate the effectiveness of TextMatcher on the automated-cheque-processing use case, where TextMatcher is shown to generalize well to future unseen dates, unlike existing models designed for related problems. We further assess the performance of TextMatcher on different distributions of errors on the public IAM dataset. Results attest that, compared to a naïve model and existing models for related problems, TextMatcher achieves higher performance on a variety of configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
TextMatcher has been deployed at UniCredit, and it is currently used in production.
2.
https://github.com/ayumiymk/aster.pytorch.

References

Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE TPAMI 36(12), 2552–2566 (2014)
Article Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
Google Scholar
Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE TPAMI 41(2), 423–443 (2018)
Article Google Scholar
Chen, X., Jin, L., Zhu, Y., Luo, C., Wang, T.: Text recognition in the wild: a survey. ACM CSUR 54(2), 42:1-42:35 (2021)
Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, pp. 249–256 (2010)
Google Scholar
Gómez, L., Rusinol, M., Karatzas, D.: LSDE: levenshtein space deep embedding for query-by-string word spotting. In: ICDAR, pp. 499–504 (2017)
Google Scholar
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR, pp. 1735–1742 (2006)
Google Scholar
Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV, pp. 212–228 (2018)
Google Scholar
Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. IJDAR 5(1), 39–46 (2002)
Article MATH Google Scholar
Mhiri, M., Desrosiers, C., Cheriet, M.: Word spotting and recognition via a joint deep embedding of image and text. Pattern Recogn. 88, 312–320 (2019)
Article Google Scholar
Qi, X., Zhang, Y., Qi, J., Lu, H.: Self-attention guided representation learning for image-text matching. Neurocomputing 450, 143–155 (2021)
Article Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Google Scholar
Rath, T.M., Manmatha, R.: Word spotting for historical documents. IJDAR 9(2), 139–152 (2007)
Article Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR, pp. 779–788 (2016)
Google Scholar
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: CVPR, pp. 4168–4176 (2016)
Google Scholar
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: ASTER: an attentional scene text recognizer with flexible rectification. IEEE TPAMI 41(9), 2035–2048 (2018)
Article Google Scholar
Sudholt, S., Fink, G.A.: PHOCNet: a deep convolutional neural network for word spotting in handwritten documents. In: ICFHR, pp. 277–282 (2016)
Google Scholar
Sudholt, S., Fink, G.A.: Attribute CNNs for word spotting in handwritten documents. IJDAR 21(3), 199–218 (2018)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Google Scholar
Wang, Y., et al.: Position focused attention network for image-text matching. In: IJCAI, pp. 3792–3798 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

UniCredit, Milan, Italy
Valentina Arrigoni, Luisa Repele & Dario Marino Saccavino

Authors

Valentina Arrigoni
View author publications
You can also search for this author in PubMed Google Scholar
Luisa Repele
View author publications
You can also search for this author in PubMed Google Scholar
Dario Marino Saccavino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valentina Arrigoni .

Editor information

Editors and Affiliations

University of Montpellier, Montpellier, France
Poncelet Pascal
INRAE, Montpellier, France
Dino Ienco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arrigoni, V., Repele, L., Saccavino, D.M. (2022). TextMatcher: Cross-Attentional Neural Network to Compare Image and Text. In: Pascal, P., Ienco, D. (eds) Discovery Science. DS 2022. Lecture Notes in Computer Science(), vol 13601. Springer, Cham. https://doi.org/10.1007/978-3-031-18840-4_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-18840-4_25
Published: 06 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18839-8
Online ISBN: 978-3-031-18840-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TextMatcher: Cross-Attentional Neural Network to Compare Image and Text