Skip to main content
Log in

HWNet v2: an efficient word image representation for handwritten documents

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

We present a framework for learning an efficient holistic representation for handwritten word images. The proposed method uses a deep convolutional neural network with traditional classification loss. The major strengths of our work lie in: (i) the efficient usage of synthetic data to pre-train a deep network, (ii) an adapted version of the ResNet-34 architecture with the region of interest pooling (referred to as HWNet v2) which learns discriminative features for variable sized word images, and (iii) a realistic augmentation of training data with multiple scales and distortions which mimics the natural process of handwriting. We further investigate the process of transfer learning to reduce the domain gap between synthetic and real domain and also analyze the invariances learned at different layers of the network using visualization techniques proposed in the literature. Our representation leads to a state-of-the-art word spotting performance on standard handwritten datasets and historical manuscripts in different languages with minimal representation size. On the challenging iam dataset, our method is first to report an mAP of around 0.90 for word spotting with a representation size of just 32 dimensions. Furthermore, we also present results on printed document datasets in English and Indic scripts which validates the generic nature of the proposed framework for learning word image representation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27

Similar content being viewed by others

Notes

  1. We use ImageMagick for rendering the word images. URL: http://www.imagemagick.org/script/index.php.

  2. http://cvit.iiit.ac.in/research/projects/cvit-projects/hwnet.

References

  1. Aldavert, D., Rusinol, M., Toledo, R., Lladós, J.: Integrating visual and textual cues for query-by-string word spotting. In: ICDAR (2013)

  2. Aldavert, D., Rusiñol, M., Toledo, R., Lladós, J.: A study of bag-of-visual-words representations for handwritten keyword spotting. In: IJDAR (2015)

  3. Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Segmentation-free word spotting with exemplar SVMs. In: PR (2014)

  4. Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. In: PAMI (2014)

  5. Ambati, V., Balakrishnan, N., Reddy, R., Pratha, L., Jawahar, C.V.: The digital library of India Project: process, policies and architecture. In: ICDL (2007)

  6. Axler, G., Wolf, L.: Toward a dataset-agnostic word segmentation method. In: ICIP (2018)

  7. Balasubramanian, A., Meshesha, M., Jawahar, C.V.: Retrieval from document image collections. In: DAS (2006)

    Google Scholar 

  8. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICML (2009)

  9. Causer, T., Wallace, V.: Building a volunteer community: results and findings from Transcribe Bentham. In: Digital Humanities Quarterly (2012)

  10. Chen, K., Seuret, M., Liwicki, M., Hennebert, J., Ingold, R.: Page segmentation of historical document images with convolutional autoencoders. In: ICDAR (2015)

  11. Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV (2004)

  12. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)

  13. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  14. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR09 (2009)

  15. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: a deep convolutional activation feature for generic visual recognition. In: ICML (2014)

  16. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. In: IJCV (2010)

  17. Fischer, A., Frinken, V., Bunke, H., Suen, C.Y.: Improving HMM-based keyword spotting with character language models. In: 2013 12th International Conference on Document Analysis and Recognition (2013)

  18. Fischer, A., Keller, A., Frinken, V., Bunke, H.: Lexicon-free handwritten word spotting using character HMMs. In: PRL (2012)

  19. Ghosh, S., Valveny, E.: Text box proposals for handwritten word spotting from documents. In: IJDAR (2018)

    Article  Google Scholar 

  20. Ghosh, S.K., Valveny, E.: A sliding window framework for word spotting based on word attributes. In: IbPRIA (2015)

  21. Giotis, A.P., Sfikas, G., Gatos, B., Nikou, C.: A survey of document image word spotting techniques. Pattern Recognit. 68, 310–332 (2017)

    Article  Google Scholar 

  22. Girshick, R.: Fast R-CNN. In: ICCV (2015)

  23. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)

  24. Gómez, L., Rusinol, M., Karatzas, D.: LSDE: Levenshtein space deep embedding for query-by-string word spotting. In: ICDAR (2017)

  25. Gordo, A., Almazán, J., Murray, N., Perronin, F.: LEWIS: latent embeddings for word images and their semantics. In: ICCV (2015)

  26. Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: ICDAR (2015)

  27. Harris, C., Stephens, M.: A combined corner and edge detector. In: Alvey Vision Conference (1988)

  28. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: ICCV (2015)

  29. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

  30. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: CoRR (2015)

  31. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. In: IJCV (2014)

  32. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: CoRR (2014)

  33. Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In: ECCV (2014)

  34. Kovalchuk, A., Wolf, L., Dershowitz, N.: A simple and fast word spotting method. In: ICFHR (2014)

  35. Krishnan, P., Dutta, K., Jawahar, C.V.: Deep feature embedding for accurate recognition and retrieval of handwritten text. In: ICFHR (2016)

  36. Krishnan, P., Dutta, K., Jawahar, C.V.: Word spotting and recognition using deep embedding. In: DAS (2018)

  37. Krishnan, P., Jawahar, C.V.: Matching handwritten document images. In: ECCV (2016)

  38. Krishnan, P., Shekhar, R., Jawahar, C.: Content level access to digital library of India pages. In: ICVGIP (2012)

  39. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)

  40. Kumar, A., Jawahar, C.V., Manmatha, R.: Efficient search in document image collections. In: ACCV (2007)

  41. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. In: IJCV (2004)

    Article  Google Scholar 

  42. Maaten, L.v.d., Hinton, G.: Visualizing data using t-SNE. In: JMLR (2008)

  43. Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: CVPR (2015)

  44. Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-SVMs for object detection and beyond. In: ICCV (2011)

  45. Manmatha, R., Han, C., Riseman, E.M.: Word spotting: A new approach to indexing handwriting. In: CVPR (1996)

  46. Marti, U., Bunke, H.: Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system. In: IJPRAI (2001)

  47. Marti, U., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. In: IJDAR (2002)

  48. Meshesha, M., Jawahar, C.V.: Matching Word Images for Content-based Retrieval from Printed Document Images. In: IJDAR (2008)

  49. Myers, C., Rabiner, L., Rosenberg, A.: Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Trans. Acoust. Speech Signal Process. 28(6), 623–635 (1980)

    Article  Google Scholar 

  50. Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: CVPR (2007)

  51. Perronnin, F., Rodríguez-Serrano, J.A.: Fisher kernels for handwritten word-spotting. In: ICDAR (2009)

  52. Poznanski, A., Wolf, L.: CNN-N-Gram for handwriting word recognition. In: CVPR (2016)

  53. Pratikakis, I., Zagoris, K., Gatos, B., Puigcerver, J., Toselli, A.H., Vidal, E.: ICFHR2016 handwritten keyword spotting competition (H-KWS 2016). In: ICFHR (2016)

  54. Rath, T.M., Manmatha, R.: Word image matching using dynamic time warping. In: CVPR (2003)

  55. Rath, T.M., Manmatha, R.: Word spotting for historical documents. In: IJDAR (2007)

  56. Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: CVPR (2014)

  57. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (2015)

  58. Rodriguez, J.A., Perronnin, F.: Local gradient histogram features for word spotting in unconstrained handwritten documents (2008)

  59. Rodríguez-Serrano, J.A., Perronnin, F.: A model-based sequence similarity with application to handwritten word spotting. In: PAMI (2012)

  60. Rohlicek, J.R., Russell, W., Roukos, S., Gish, H.: Continuous hidden Markov modeling for speaker-independent word spotting. In: ICASSP (1989)

  61. Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR (2016)

  62. Rosten, E., Drummond, T.: Machine learning for high-speed corner detection. In: ECCV (2006)

  63. Rothacker, L., Rusinol, M., Fink, G.A.: Bag-of-features HMMs for segmentation-free word spotting in handwritten documents. In: ICDAR (2013)

  64. Rothacker, L., Sudholt, S., Rusakov, E., Kasperidus, M., Fink, G.A.: Word hypotheses for segmentation-free word spotting in historic document images. In: ICDAR

  65. Roy, P.P., Rayar, F., Ramel, J.Y.: Word spotting in historical documents using primitive codebook and dynamic programming. Image Vis. Comput. 44, 15–28 (2015)

    Article  Google Scholar 

  66. Rozantsev, A., Lepetit, V., Fua, P.: On rendering synthetic images for training an object detector. In: CVIU (2015)

  67. Rusiñol, M., Aldavert, D., Toledo, R., Lladós, J.: Browsing heterogeneous document collections by a segmentation-free word spotting method. In: ICDAR (2011)

  68. Rusiñol, M., Aldavert, D., Toledo, R., Lladós, J.: Efficient segmentation-free keyword spotting in historical document collections. In: PR (2015)

  69. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26(1), 43–49 (1978)

    Article  Google Scholar 

  70. Shekhar, R., Jawahar, C.V.: Word image retrieval using bag of visual words. In: DAS (2012)

  71. Simard, P.Y., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural networks applied to visual document analysis. In: ICDAR (2003)

  72. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: CoRR (2014)

  73. Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: ICCV (2003)

  74. Sudholt, S., Fink, G.A.: PHOCNet: A deep convolutional neural network for word spotting in handwritten documents. In: ICFHR (2016)

  75. Sudholt, S., Fink, G.A.: Evaluating word string embeddings and loss functions for CNN-based word spotting. In: ICDAR (2017)

  76. Sudholt, S., Fink, G.A.: Attribute CNNs for word spotting in handwritten documents. Int. J. Doc. Anal. Recognit. (IJDAR) 21(3), 199–218 (2018)

    Article  Google Scholar 

  77. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015)

  78. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)

  79. Terasawa, K., Tanaka, Y.: Slit style HOG feature for document image word spotting. In: ICDAR (2009)

  80. Toselli, A.H., Vidal, E., Romero, V., Frinken, V.: HMM word graph based keyword spotting in handwritten document images. Inf. Sci. 370–371, 497–518 (2016)

    Article  Google Scholar 

  81. Vinciarelli, A., Bengio, S.: Offline cursive word recognition using continuous density hidden markov models trained with PCA or ICA features. In: ICPR (2002)

  82. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: CVPR (2010)

  83. Wilkinson, T., Brun, A.: Semantic and verbatim word spotting using deep neural networks. In: ICFHR (2016)

  84. Wilkinson, T., Lindstrom, J., Brun, A.: Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections. In: ICCV (2017)

  85. Wilkinson, T., Lindström, J., Brun, A.: Neural word search in historical manuscript collections. In: CoRR arXiv:1812.02771 (2018)

  86. Yalniz, I.Z., Manmatha, R.: An efficient framework for searching text in noisy document images. In: DAS (2012)

  87. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: CVPR (2009)

  88. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: NIPS (2014)

  89. Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., Lipson, H.: Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579 (2015)

  90. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: ECCV (2014)

Download references

Acknowledgements

This work is partly supported by IMPRINT. Praveen Krishnan is supported by Amazon Alexa Graduate Fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Praveen Krishnan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Krishnan, P., Jawahar, C.V. HWNet v2: an efficient word image representation for handwritten documents. IJDAR 22, 387–405 (2019). https://doi.org/10.1007/s10032-019-00336-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-019-00336-x

Keywords

Navigation