Skip to main content
Log in

Textmatcher: cross-attentional neural network to compare image and text

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

We study a multimodal-learning problem where, given an image containing a single-line (printed or handwritten) text and a candidate text transcription, the goal is to assess whether the text represented in the image corresponds to the candidate text. This problem, which we dub text matching, is primarily motivated by a real industrial application scenario of automated cheque processing, whose goal is to automatically assess whether the information in a bank cheque (e.g., issue date) match the data that have been entered by the customer while depositing the cheque to an automated teller machine (ATM). The problem finds more general application in several other scenarios too, e.g., personal-identity-document processing in user-registration procedures. We devise a machine-learning model specifically designed for the text-matching problem. The proposed model, termed TextMatcher, compares the two inputs by applying a novel cross-attention mechanism over the embedding representations of image and text, and it is trained in an end-to-end fashion on the desired distribution of errors to be detected. We demonstrate the effectiveness of TextMatcher on the automated-cheque-processing use case, where TextMatcher is shown to generalize well to future unseen dates, unlike existing models designed for related problems. We further assess the performance of TextMatcher on different distributions of errors on the public IAM dataset. Results attest that, compared to a naïve model, a variant with fully-connected layers instead of the cross-attention module and existing models for related problems, TextMatcher achieves higher performance on a variety of configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Availability of data and materials

We use the publicly available IAM handwriting database (Marti and Bunke 2002) and a a proprietary dataset of bank cheques provided by UniCredit.

Code availability

Code is proprietary.

Notes

  1. TextMatcher has been deployed at UniCredit, and it is currently used in production.

  2. https://github.com/ayumiymk/aster.pytorch.

References

  • Aizawa, A. (2003). An information-theoretic perspective of TF-IDF measures. Information Processing & Management, 39(1), 45–65.

    Article  Google Scholar 

  • Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting and recognition with embedded attributes. IEEE TPAMI, 36(12), 2552–2566.

    Article  Google Scholar 

  • Arrigoni, V., Repele, L., & Saccavino, D. M. (2022). Textmatcher: Cross-attentional neural network to compare image and text. In P. Pascal & D. Ienco (Eds.), Discovery Science (pp. 347–362). Cham: Springer.

    Chapter  Google Scholar 

  • Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In ICLR.

  • Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2018). Multimodal machine learning: A survey and taxonomy. IEEE TPAMI, 41(2), 423–443.

    Article  Google Scholar 

  • Chen, X., Jin, L., Zhu, Y., Luo, C., & Wang, T. (2021). Text recognition in the wild: A survey. ACM CSUR, 54(2), 42–14235.

    Google Scholar 

  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  • Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In AISTATS (pp. 249–256).

  • Gómez, L., Rusinol, M., & Karatzas, D. (2017). LSDE: Levenshtein space deep embedding for query-by-string word spotting. In: ICDAR (pp. 499–504).

  • Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., & Schmidhuber, J. (2008). A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 855–868.

    Article  Google Scholar 

  • Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In CVPR (pp. 1735–1742).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116, 1–20.

    Article  MathSciNet  Google Scholar 

  • Kim, Y., Jernite, Y., Sontag, D., & Rush, A. (2016). Character-aware neural language models. In Proceedings of the AAAI conference on artificial intelligence (Vol. 30).

  • Lee, K.-H., Chen, X., Hua, G., Hu, H., & He, X. (2018). Stacked cross attention for image-text matching. In Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV (pp. 212–228)

  • Li, H., & Shen, C. (2016). Reading car license plates using deep convolutional neural networks and LSTMs. arXiv preprint arXiv:1601.05610

  • Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florencio, D., Zhang, C., Li, Z., & Wei, F. (2023). TROCR: Transformer-based optical character recognition with pre-trained models. In Proceedings of the AAAI Conference on Artificial Intelligence (vol. 37, pp. 13094–13102).

  • Marti, U.-V., & Bunke, H. (2002). The IAM-database: An English sentence database for offline handwriting recognition. IJDAR, 5(1), 39–46.

    Article  Google Scholar 

  • Mhiri, M., Desrosiers, C., & Cheriet, M. (2019). Word spotting and recognition via a joint deep embedding of image and text. Pattern Recognition, 88, 312–320.

    Article  Google Scholar 

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  • Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543).

  • Qi, X., Zhang, Y., Qi, J., & Lu, H. (2021). Self-attention guided representation learning for image-text matching. Neurocomputing, 450, 143–155.

    Article  Google Scholar 

  • Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In: ICML (pp. 8748–8763).

  • Rath, T. M., & Manmatha, R. (2007). Word spotting for historical documents. IJDAR, 9(2), 139–152.

    Article  Google Scholar 

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In CVPR (pp. 779–788).

  • Shi, B., Wang, X., Lyu, P., Yao, C., & Bai, X. (2016). Robust scene text recognition with automatic rectification. In CVPR (pp. 4168–4176).

  • Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & Bai, X. (2018). Aster: An attentional scene text recognizer with flexible rectification. IEEE TPAMI, 41(9), 2035–2048.

    Article  Google Scholar 

  • Sudholt, S., & Fink, G.A. (2016). Phocnet: A deep convolutional neural network for word spotting in handwritten documents. In ICFHR (pp. 277–282).

  • Sudholt, S., & Fink, G. A. (2018). Attribute CNNs for word spotting in handwritten documents. IJDAR, 21(3), 199–218.

    Article  Google Scholar 

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NIPS (pp. 5998–6008).

  • Voigtlaender, P., Doetsch, P., & Ney, H. (2016). Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR) (pp. 228–233). IEEE.

  • Wang, Y., Yang, H., Qian, X., Ma, L., Lu, J., Li, B., & Fan, X. (2019). Position focused attention network for image-text matching. In IJCAI (pp. 3792–3798).

  • Wilkinson, T., & Brun, A. (2016). Semantic and verbatim word spotting using deep neural networks. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR) (pp. 307–312). IEEE.

  • Yousef, M., Hussain, K. F., & Mohammed, U. S. (2020). Accurate, data-efficient, unconstrained text recognition with convolutional neural networks. Pattern Recognition, 108, 107482.

    Article  Google Scholar 

  • Zhang, Y., Jin, R., & Zhou, Z.-H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1, 43–52.

    Article  Google Scholar 

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

VA designed the model and the experimental framework for the industrial use case. All authors contributed to the experiments for the general applicability of the model. All authors discussed the results and contributed to the final manuscript.

Corresponding authors

Correspondence to Valentina Arrigoni, Luisa Repele or Dario Marino Saccavino.

Ethics declarations

Conflict of interest

No conflicts of interests or competing interests.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editors: Dino Ienco, Roberto Interdonato, Pascal Poncelet.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Arrigoni, V., Repele, L. & Saccavino, D.M. Textmatcher: cross-attentional neural network to compare image and text. Mach Learn 113, 2045–2066 (2024). https://doi.org/10.1007/s10994-023-06418-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-023-06418-6

Keywords

Navigation