Textmatcher: cross-attentional neural network to compare image and text

Arrigoni, Valentina; Repele, Luisa; Saccavino, Dario Marino

doi:10.1007/s10994-023-06418-6

Textmatcher: cross-attentional neural network to compare image and text

Published: 06 November 2023

Volume 113, pages 2045–2066, (2024)
Cite this article

Machine Learning Aims and scope Submit manuscript

Valentina Arrigoni ORCID: orcid.org/0009-0003-9034-9141¹,
Luisa Repele¹ &
Dario Marino Saccavino¹

147 Accesses
1 Altmetric
Explore all metrics

Abstract

We study a multimodal-learning problem where, given an image containing a single-line (printed or handwritten) text and a candidate text transcription, the goal is to assess whether the text represented in the image corresponds to the candidate text. This problem, which we dub text matching, is primarily motivated by a real industrial application scenario of automated cheque processing, whose goal is to automatically assess whether the information in a bank cheque (e.g., issue date) match the data that have been entered by the customer while depositing the cheque to an automated teller machine (ATM). The problem finds more general application in several other scenarios too, e.g., personal-identity-document processing in user-registration procedures. We devise a machine-learning model specifically designed for the text-matching problem. The proposed model, termed TextMatcher, compares the two inputs by applying a novel cross-attention mechanism over the embedding representations of image and text, and it is trained in an end-to-end fashion on the desired distribution of errors to be detected. We demonstrate the effectiveness of TextMatcher on the automated-cheque-processing use case, where TextMatcher is shown to generalize well to future unseen dates, unlike existing models designed for related problems. We further assess the performance of TextMatcher on different distributions of errors on the public IAM dataset. Results attest that, compared to a naïve model, a variant with fully-connected layers instead of the cross-attention module and existing models for related problems, TextMatcher achieves higher performance on a variety of configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

TextMatcher: Cross-Attentional Neural Network to Compare Image and Text

Knowledge Integration Inside Multitask Network for Analysis of Unseen ID Types

Multimodal Deep Networks for Text and Image-Based Document Classification

Availability of data and materials

We use the publicly available IAM handwriting database (Marti and Bunke 2002) and a a proprietary dataset of bank cheques provided by UniCredit.

Code availability

Code is proprietary.

Notes

TextMatcher has been deployed at UniCredit, and it is currently used in production.
https://github.com/ayumiymk/aster.pytorch.

References

Aizawa, A. (2003). An information-theoretic perspective of TF-IDF measures. Information Processing & Management, 39(1), 45–65.
Article Google Scholar
Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting and recognition with embedded attributes. IEEE TPAMI, 36(12), 2552–2566.
Article Google Scholar
Arrigoni, V., Repele, L., & Saccavino, D. M. (2022). Textmatcher: Cross-attentional neural network to compare image and text. In P. Pascal & D. Ienco (Eds.), Discovery Science (pp. 347–362). Cham: Springer.
Chapter Google Scholar
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In ICLR.
Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2018). Multimodal machine learning: A survey and taxonomy. IEEE TPAMI, 41(2), 423–443.
Article Google Scholar
Chen, X., Jin, L., Zhu, Y., Luo, C., & Wang, T. (2021). Text recognition in the wild: A survey. ACM CSUR, 54(2), 42–14235.
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In AISTATS (pp. 249–256).
Gómez, L., Rusinol, M., & Karatzas, D. (2017). LSDE: Levenshtein space deep embedding for query-by-string word spotting. In: ICDAR (pp. 499–504).
Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., & Schmidhuber, J. (2008). A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 855–868.
Article Google Scholar
Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In CVPR (pp. 1735–1742).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116, 1–20.
Article MathSciNet Google Scholar
Kim, Y., Jernite, Y., Sontag, D., & Rush, A. (2016). Character-aware neural language models. In Proceedings of the AAAI conference on artificial intelligence (Vol. 30).
Lee, K.-H., Chen, X., Hua, G., Hu, H., & He, X. (2018). Stacked cross attention for image-text matching. In Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV (pp. 212–228)
Li, H., & Shen, C. (2016). Reading car license plates using deep convolutional neural networks and LSTMs. arXiv preprint arXiv:1601.05610
Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florencio, D., Zhang, C., Li, Z., & Wei, F. (2023). TROCR: Transformer-based optical character recognition with pre-trained models. In Proceedings of the AAAI Conference on Artificial Intelligence (vol. 37, pp. 13094–13102).
Marti, U.-V., & Bunke, H. (2002). The IAM-database: An English sentence database for offline handwriting recognition. IJDAR, 5(1), 39–46.
Article Google Scholar
Mhiri, M., Desrosiers, C., & Cheriet, M. (2019). Word spotting and recognition via a joint deep embedding of image and text. Pattern Recognition, 88, 312–320.
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543).
Qi, X., Zhang, Y., Qi, J., & Lu, H. (2021). Self-attention guided representation learning for image-text matching. Neurocomputing, 450, 143–155.
Article Google Scholar
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In: ICML (pp. 8748–8763).
Rath, T. M., & Manmatha, R. (2007). Word spotting for historical documents. IJDAR, 9(2), 139–152.
Article Google Scholar
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In CVPR (pp. 779–788).
Shi, B., Wang, X., Lyu, P., Yao, C., & Bai, X. (2016). Robust scene text recognition with automatic rectification. In CVPR (pp. 4168–4176).
Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & Bai, X. (2018). Aster: An attentional scene text recognizer with flexible rectification. IEEE TPAMI, 41(9), 2035–2048.
Article Google Scholar
Sudholt, S., & Fink, G.A. (2016). Phocnet: A deep convolutional neural network for word spotting in handwritten documents. In ICFHR (pp. 277–282).
Sudholt, S., & Fink, G. A. (2018). Attribute CNNs for word spotting in handwritten documents. IJDAR, 21(3), 199–218.
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NIPS (pp. 5998–6008).
Voigtlaender, P., Doetsch, P., & Ney, H. (2016). Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR) (pp. 228–233). IEEE.
Wang, Y., Yang, H., Qian, X., Ma, L., Lu, J., Li, B., & Fan, X. (2019). Position focused attention network for image-text matching. In IJCAI (pp. 3792–3798).
Wilkinson, T., & Brun, A. (2016). Semantic and verbatim word spotting using deep neural networks. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR) (pp. 307–312). IEEE.
Yousef, M., Hussain, K. F., & Mohammed, U. S. (2020). Accurate, data-efficient, unconstrained text recognition with convolutional neural networks. Pattern Recognition, 108, 107482.
Article Google Scholar
Zhang, Y., Jin, R., & Zhou, Z.-H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1, 43–52.
Article Google Scholar

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Group Data and Intelligence, UniCredit, Milan, Italy
Valentina Arrigoni, Luisa Repele & Dario Marino Saccavino

Authors

Valentina Arrigoni
View author publications
You can also search for this author in PubMed Google Scholar
Luisa Repele
View author publications
You can also search for this author in PubMed Google Scholar
Dario Marino Saccavino
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

VA designed the model and the experimental framework for the industrial use case. All authors contributed to the experiments for the general applicability of the model. All authors discussed the results and contributed to the final manuscript.

Corresponding authors

Correspondence to Valentina Arrigoni, Luisa Repele or Dario Marino Saccavino.

Ethics declarations

Conflict of interest

No conflicts of interests or competing interests.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editors: Dino Ienco, Roberto Interdonato, Pascal Poncelet.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Arrigoni, V., Repele, L. & Saccavino, D.M. Textmatcher: cross-attentional neural network to compare image and text. Mach Learn 113, 2045–2066 (2024). https://doi.org/10.1007/s10994-023-06418-6

Download citation

Received: 09 March 2023
Revised: 03 August 2023
Accepted: 03 October 2023
Published: 06 November 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s10994-023-06418-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Textmatcher: cross-attentional neural network to compare image and text

Abstract

Access this article

Similar content being viewed by others

TextMatcher: Cross-Attentional Neural Network to Compare Image and Text

Knowledge Integration Inside Multitask Network for Analysis of Unseen ID Types

Multimodal Deep Networks for Text and Image-Based Document Classification

Availability of data and materials

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Textmatcher: cross-attentional neural network to compare image and text

Abstract

Access this article

Similar content being viewed by others

TextMatcher: Cross-Attentional Neural Network to Compare Image and Text

Knowledge Integration Inside Multitask Network for Analysis of Unseen ID Types

Multimodal Deep Networks for Text and Image-Based Document Classification

Availability of data and materials

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation