Skip to main content
Log in

Refocus attention span networks for handwriting line recognition

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Recurrent neural networks have achieved outstanding recognition performance for handwriting identification despite the enormous variety observed across diverse handwriting structures and poor-quality scanned documents. We initially proposed a BiLSTM baseline model with a sequential architecture well-suited for modeling text lines due to its ability to learn probability distributions over character or word sequences. However, employing such recurrent paradigms prevents parallelization and suffers from vanishing gradients for long sequences during training. To alleviate these limitations, we propose four significant contributions to this work. First, we devised an end-to-end model composed of a split-attention CNN-backbone that serves as a feature extraction method and a self-attention Transformer encoder–decoder that serves as a transcriber method to recognize handwriting manuscripts. The multi-head self-attention layers in an encoder–decoder transformer-based enhance the model’s ability to tackle handwriting recognition and learn the linguistic dependencies of character sequences. Second, we conduct various studies on transfer learning (TL) from large datasets to a small database, determining which model layers require fine-tuning. Third, we attained an efficient paradigm by combining different strategies of TL with data augmentation (DA). Finally, since the robustness of the proposed model is lexicon-free and can recognize sentences not presented in the training phase, the model is only trained on a few labeled examples with no extra cost of generating and training on synthetic datasets. We recorded comparable and outperformed Character and Word Error Rates CER/WER on four benchmark datasets to the most recent (SOTA) models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Aberdam, A., Litman, R., Tsiper, S., Anschel, O., Slossberg, R., Mazor, S., Manmatha, R., Perona, P.: Sequence-to-sequence contrastive learning for text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15302–15312 (2021)

  2. Aradillas, J.C., Murillo-Fuentes, J.J., Olmos, P.M.: Boosting offline handwritten text recognition in historical documents with few labeled lines. IEEE Access 9, 76674–76688 (2021)

    Article  Google Scholar 

  3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014). arXiv preprint arXiv:1409.0473

  4. Bhunia, A.K., Khan, S., Cholakkal, H., Anwer, R.M., Khan, F.S., Shah, M.: Handwriting transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1086–1094 (2021)

  5. Bianne-Bernard, A.L., Menasri, F., Mohamad, R.A.H., Mokbel, C., Kermorvant, C., Likforman-Sulem, L.: Dynamic and contextual information in hmm modeling for handwritten word recognition. IEEE Transact. pattern. Anal. Mach. Intell. 33(10), 2066–2080 (2011)

    Article  Google Scholar 

  6. Bluche, T., Messina, R.: Gated convolutional recurrent neural networks for multilingual handwriting recognition. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 646–651. IEEE (2017)

  7. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Adv. Neural Inform. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  8. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229. Springer (2020)

  9. Cascianelli, S., Cornia, M., Baraldi, L., Cucchiara, R.: Boosting modern and historical handwritten text recognition with deformable convolutions. International Journal on Document Analysis and Recognition (IJDAR) 1–11 (2022)

  10. Causer, T., Wallace, V.: Building a volunteer community: results and findings from Transcribe Bentham. Digit. Humanit. Q. 6(2) (2012)

  11. Chammas, E., Mokbel, C., Likforman-Sulem, L.: Handwriting recognition of historical documents with few labeled data. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 43–48. IEEE (2018)

  12. Chen, K.N., Chen, C.H., Chang, C.C.: Efficient illumination compensation techniques for text images. Digital Signal Process. 22(5), 726–733 (2012)

    Article  Google Scholar 

  13. Chen, X., Mishra, N., Rohaninejad, M., Abbeel, P.: Pixelsnail: An improved autoregressive generative model. In: International Conference on Machine Learning, pp. 864–872. PMLR (2018)

  14. Chen, Z., Wu, Y., Yin, F., Liu, C.L.: Simultaneous script identification and handwriting recognition via multi-task learning of recurrent neural networks. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 525–530. IEEE (2017)

  15. Cheng, J., Dong, L., Lapata, M.: Long short-term memory-networks for machine reading (2016). arXiv preprint arXiv:1601.06733

  16. Cui, Z., Ke, R., Pu, Z., Wang, Y.: Deep bidirectional and unidirectional lstm recurrent neural network for network-wide traffic speed prediction (2018). arXiv preprint arXiv:1801.02143

  17. de Sousa Neto, A.F., Bezerra, B.L.D., Toselli, A.H., Lima, E.B.: Htr-flor++ a handwritten text recognition system based on a pipeline of optical and language models. In: Proceedings of the ACM Symposium on Document Engineering 2020, pp. 1–4 (2020)

  18. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

  19. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv preprint arXiv:1810.04805

  20. Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. IEEE (2018)

  21. Espana-Boquera, S., Castro-Bleda, M.J., Gorbe-Moya, J., Zamora-Martinez, F.: Improving offline handwritten text recognition with hybrid hmm/ann models. IEEE Transact. Pattern Anal. Mach. Intell. 33(4), 767–779 (2010)

    Article  Google Scholar 

  22. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine learning, pp. 369–376 (2006)

  23. Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Transact. Pattern Anal. Mach. Intell. 31(5), 855–868 (2008)

    Article  Google Scholar 

  24. Graves, A., Schmidhuber, J.: Offline handwriting recognition with multidimensional recurrent neural networks. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, vol. 21, pp. 545–552. Curran Associates, Inc. (2008). https://proceedings.neurips.cc/paper/2008/file/66368270ffd51418ec58bd793f2d9b1b-Paper.pdf

  25. Gregor, K., Danihelka, I., Graves, A., Rezende, D., Wierstra, D.: Draw: A recurrent neural network for image generation. In: International Conference on Machine Learning, pp. 1462–1471. PMLR (2015)

  26. Grosicki, E., Carré, M., Brodin, J.M., Geoffrois, E.: Results of the rimes evaluation campaign for handwritten mail processing. In: 2009 10th International Conference on Document Analysis and Recognition, pp. 941–945. IEEE (2009)

  27. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)

  28. Kane, S., Lehman, A., Partridge, E.: Indexing george washington’s handwritten manuscripts. Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst, MA 1003 (2001)

  29. Kang, D., Lv, Y., Chen, Y.Y.: Short-term traffic flow prediction with lstm recurrent neural network. In: 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pp. 1–6. IEEE (2017)

  30. Kang, L., Riba, P., Rusiñol, M., Fornés, A., Villegas, M.: Pay attention to what you read: Non-recurrent handwritten text-line recognition (2020). arXiv preprint arXiv:2005.13044

  31. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)

    Article  Google Scholar 

  32. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

  33. Marti, U.V., Bunke, H.: The iam-database: an english sentence database for offline handwriting recognition. Inter. J. Doc. Anal. Recognit. 5(1), 39–46 (2002)

    Article  MATH  Google Scholar 

  34. Mermelstein, P., Eyden, M.: A system for automatic recognition of handwritten words. In: Proceedings of the October 27-29, 1964, fall joint computer conference, part I, pp. 333–342 (1964)

  35. Michael, J., Labahn, R., Grüning, T., Zöllner, J.: Evaluating sequence-to-sequence models for handwritten text recognition. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1286–1293. IEEE (2019)

  36. Moreno, P., Ho, P., Vasconcelos, N.: A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications. Adv. Neural Inf. Process. Syst. 16 (2003)

  37. Müller, R., Kornblith, S., Hinton, G.E.: When does label smoothing help? In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/f1748d6b0fd9d439f71450117eba2725-Paper.pdf

  38. Parikh, A.P., Täckström, O., Das, D., Uszkoreit, J.: A decomposable attention model for natural language inference (2016). arXiv preprint arXiv:1606.01933

  39. Pham, V., Bluche, T., Kermorvant, C., Louradour, J.: Dropout improves recurrent neural networks for handwriting recognition. In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 285–290. IEEE (2014)

  40. Plizzari, C., Cannici, M., Matteucci, M.: Skeleton-based action recognition via spatial and temporal transformer networks. Comput. Vision Image Understanding 208, 103219 (2021)

    Article  Google Scholar 

  41. Plötz, T., Fink, G.A.: Markov models for offline handwriting recognition: a survey. Inter. J. Doc. Anal. Recognit. (IJDAR) 12(4), 269–298 (2009)

    Article  Google Scholar 

  42. Poznanski, A., Wolf, L.: Cnn-n-gram for handwriting word recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2305–2314 (2016)

  43. Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 67–72. IEEE (2017)

  44. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In:International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

  45. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)

  46. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transact. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)

    Article  Google Scholar 

  47. Sueiras, J., Ruiz, V., Sanchez, A., Velez, J.F.: Offline continuous handwriting recognition using sequence to sequence neural networks. Neurocomputing 289, 119–128 (2018)

    Article  Google Scholar 

  48. Vinciarelli, A., Luettin, J.: A new normalization technique for cursive handwritten words. Pattern Recognit. Lett. 22(9), 1043–1050 (2001)

    Article  MATH  Google Scholar 

  49. Wigington, C., Stewart, S., Davis, B., Barrett, B., Price, B., Cohen, S.: Data augmentation for recognition of handwritten words and lines using a cnn-lstm network. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 639–645. IEEE (2017)

  50. Yadav, S.S., Jadhav, S.M.: Deep convolutional neural network based medical image classification for disease diagnosis. J. Big Data 6(1), 1–18 (2019)

    Article  Google Scholar 

  51. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)

  52. Yujian, L., Bo, L.: A normalized levenshtein distance metric. IEEE Transact. Pattern Anal. Mach Intell. 29(6), 1091–1095 (2007)

    Article  Google Scholar 

  53. Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. In: International Conference on Machine Learning, pp. 7354–7363. PMLR (2019)

  54. Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Lin, H., Zhang, Z., Sun, Y., He, T., Mueller, J., Manmatha, R., et al.: Resnest: Split-attention networks (2020). arXiv preprint arXiv:2004.08955

  55. Zhang, Y., Nie, S., Liu, W., Xu, X., Zhang, D., Shen, H.T.: Sequence-to-sequence domain adaptation network for robust text image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2740–2749 (2019)

Download references

Acknowledgements

The authors would like to thank NSERC Canada for their financial support under grant # 2019-05230.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammed Hamdan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hamdan, M., Chaudhary, H., Bali, A. et al. Refocus attention span networks for handwriting line recognition. IJDAR 26, 131–147 (2023). https://doi.org/10.1007/s10032-022-00422-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-022-00422-7

Keywords

Navigation