Abstract
We present a framework for learning an efficient holistic representation for handwritten word images. The proposed method uses a deep convolutional neural network with traditional classification loss. The major strengths of our work lie in: (i) the efficient usage of synthetic data to pre-train a deep network, (ii) an adapted version of the ResNet-34 architecture with the region of interest pooling (referred to as HWNet v2) which learns discriminative features for variable sized word images, and (iii) a realistic augmentation of training data with multiple scales and distortions which mimics the natural process of handwriting. We further investigate the process of transfer learning to reduce the domain gap between synthetic and real domain and also analyze the invariances learned at different layers of the network using visualization techniques proposed in the literature. Our representation leads to a state-of-the-art word spotting performance on standard handwritten datasets and historical manuscripts in different languages with minimal representation size. On the challenging iam dataset, our method is first to report an mAP of around 0.90 for word spotting with a representation size of just 32 dimensions. Furthermore, we also present results on printed document datasets in English and Indic scripts which validates the generic nature of the proposed framework for learning word image representation.
Similar content being viewed by others
Notes
We use ImageMagick for rendering the word images. URL: http://www.imagemagick.org/script/index.php.
References
Aldavert, D., Rusinol, M., Toledo, R., Lladós, J.: Integrating visual and textual cues for query-by-string word spotting. In: ICDAR (2013)
Aldavert, D., Rusiñol, M., Toledo, R., Lladós, J.: A study of bag-of-visual-words representations for handwritten keyword spotting. In: IJDAR (2015)
Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Segmentation-free word spotting with exemplar SVMs. In: PR (2014)
Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. In: PAMI (2014)
Ambati, V., Balakrishnan, N., Reddy, R., Pratha, L., Jawahar, C.V.: The digital library of India Project: process, policies and architecture. In: ICDL (2007)
Axler, G., Wolf, L.: Toward a dataset-agnostic word segmentation method. In: ICIP (2018)
Balasubramanian, A., Meshesha, M., Jawahar, C.V.: Retrieval from document image collections. In: DAS (2006)
Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICML (2009)
Causer, T., Wallace, V.: Building a volunteer community: results and findings from Transcribe Bentham. In: Digital Humanities Quarterly (2012)
Chen, K., Seuret, M., Liwicki, M., Hennebert, J., Ingold, R.: Page segmentation of historical document images with convolutional autoencoders. In: ICDAR (2015)
Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV (2004)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR09 (2009)
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: a deep convolutional activation feature for generic visual recognition. In: ICML (2014)
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. In: IJCV (2010)
Fischer, A., Frinken, V., Bunke, H., Suen, C.Y.: Improving HMM-based keyword spotting with character language models. In: 2013 12th International Conference on Document Analysis and Recognition (2013)
Fischer, A., Keller, A., Frinken, V., Bunke, H.: Lexicon-free handwritten word spotting using character HMMs. In: PRL (2012)
Ghosh, S., Valveny, E.: Text box proposals for handwritten word spotting from documents. In: IJDAR (2018)
Ghosh, S.K., Valveny, E.: A sliding window framework for word spotting based on word attributes. In: IbPRIA (2015)
Giotis, A.P., Sfikas, G., Gatos, B., Nikou, C.: A survey of document image word spotting techniques. Pattern Recognit. 68, 310–332 (2017)
Girshick, R.: Fast R-CNN. In: ICCV (2015)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
Gómez, L., Rusinol, M., Karatzas, D.: LSDE: Levenshtein space deep embedding for query-by-string word spotting. In: ICDAR (2017)
Gordo, A., Almazán, J., Murray, N., Perronin, F.: LEWIS: latent embeddings for word images and their semantics. In: ICCV (2015)
Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: ICDAR (2015)
Harris, C., Stephens, M.: A combined corner and edge detector. In: Alvey Vision Conference (1988)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: ICCV (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: CoRR (2015)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. In: IJCV (2014)
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. In: CoRR (2014)
Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In: ECCV (2014)
Kovalchuk, A., Wolf, L., Dershowitz, N.: A simple and fast word spotting method. In: ICFHR (2014)
Krishnan, P., Dutta, K., Jawahar, C.V.: Deep feature embedding for accurate recognition and retrieval of handwritten text. In: ICFHR (2016)
Krishnan, P., Dutta, K., Jawahar, C.V.: Word spotting and recognition using deep embedding. In: DAS (2018)
Krishnan, P., Jawahar, C.V.: Matching handwritten document images. In: ECCV (2016)
Krishnan, P., Shekhar, R., Jawahar, C.: Content level access to digital library of India pages. In: ICVGIP (2012)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Kumar, A., Jawahar, C.V., Manmatha, R.: Efficient search in document image collections. In: ACCV (2007)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. In: IJCV (2004)
Maaten, L.v.d., Hinton, G.: Visualizing data using t-SNE. In: JMLR (2008)
Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: CVPR (2015)
Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-SVMs for object detection and beyond. In: ICCV (2011)
Manmatha, R., Han, C., Riseman, E.M.: Word spotting: A new approach to indexing handwriting. In: CVPR (1996)
Marti, U., Bunke, H.: Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system. In: IJPRAI (2001)
Marti, U., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. In: IJDAR (2002)
Meshesha, M., Jawahar, C.V.: Matching Word Images for Content-based Retrieval from Printed Document Images. In: IJDAR (2008)
Myers, C., Rabiner, L., Rosenberg, A.: Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Trans. Acoust. Speech Signal Process. 28(6), 623–635 (1980)
Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: CVPR (2007)
Perronnin, F., Rodríguez-Serrano, J.A.: Fisher kernels for handwritten word-spotting. In: ICDAR (2009)
Poznanski, A., Wolf, L.: CNN-N-Gram for handwriting word recognition. In: CVPR (2016)
Pratikakis, I., Zagoris, K., Gatos, B., Puigcerver, J., Toselli, A.H., Vidal, E.: ICFHR2016 handwritten keyword spotting competition (H-KWS 2016). In: ICFHR (2016)
Rath, T.M., Manmatha, R.: Word image matching using dynamic time warping. In: CVPR (2003)
Rath, T.M., Manmatha, R.: Word spotting for historical documents. In: IJDAR (2007)
Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: CVPR (2014)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (2015)
Rodriguez, J.A., Perronnin, F.: Local gradient histogram features for word spotting in unconstrained handwritten documents (2008)
Rodríguez-Serrano, J.A., Perronnin, F.: A model-based sequence similarity with application to handwritten word spotting. In: PAMI (2012)
Rohlicek, J.R., Russell, W., Roukos, S., Gish, H.: Continuous hidden Markov modeling for speaker-independent word spotting. In: ICASSP (1989)
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR (2016)
Rosten, E., Drummond, T.: Machine learning for high-speed corner detection. In: ECCV (2006)
Rothacker, L., Rusinol, M., Fink, G.A.: Bag-of-features HMMs for segmentation-free word spotting in handwritten documents. In: ICDAR (2013)
Rothacker, L., Sudholt, S., Rusakov, E., Kasperidus, M., Fink, G.A.: Word hypotheses for segmentation-free word spotting in historic document images. In: ICDAR
Roy, P.P., Rayar, F., Ramel, J.Y.: Word spotting in historical documents using primitive codebook and dynamic programming. Image Vis. Comput. 44, 15–28 (2015)
Rozantsev, A., Lepetit, V., Fua, P.: On rendering synthetic images for training an object detector. In: CVIU (2015)
Rusiñol, M., Aldavert, D., Toledo, R., Lladós, J.: Browsing heterogeneous document collections by a segmentation-free word spotting method. In: ICDAR (2011)
Rusiñol, M., Aldavert, D., Toledo, R., Lladós, J.: Efficient segmentation-free keyword spotting in historical document collections. In: PR (2015)
Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26(1), 43–49 (1978)
Shekhar, R., Jawahar, C.V.: Word image retrieval using bag of visual words. In: DAS (2012)
Simard, P.Y., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural networks applied to visual document analysis. In: ICDAR (2003)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: CoRR (2014)
Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: ICCV (2003)
Sudholt, S., Fink, G.A.: PHOCNet: A deep convolutional neural network for word spotting in handwritten documents. In: ICFHR (2016)
Sudholt, S., Fink, G.A.: Evaluating word string embeddings and loss functions for CNN-based word spotting. In: ICDAR (2017)
Sudholt, S., Fink, G.A.: Attribute CNNs for word spotting in handwritten documents. Int. J. Doc. Anal. Recognit. (IJDAR) 21(3), 199–218 (2018)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015)
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)
Terasawa, K., Tanaka, Y.: Slit style HOG feature for document image word spotting. In: ICDAR (2009)
Toselli, A.H., Vidal, E., Romero, V., Frinken, V.: HMM word graph based keyword spotting in handwritten document images. Inf. Sci. 370–371, 497–518 (2016)
Vinciarelli, A., Bengio, S.: Offline cursive word recognition using continuous density hidden markov models trained with PCA or ICA features. In: ICPR (2002)
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: CVPR (2010)
Wilkinson, T., Brun, A.: Semantic and verbatim word spotting using deep neural networks. In: ICFHR (2016)
Wilkinson, T., Lindstrom, J., Brun, A.: Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections. In: ICCV (2017)
Wilkinson, T., Lindström, J., Brun, A.: Neural word search in historical manuscript collections. In: CoRR arXiv:1812.02771 (2018)
Yalniz, I.Z., Manmatha, R.: An efficient framework for searching text in noisy document images. In: DAS (2012)
Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: CVPR (2009)
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: NIPS (2014)
Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., Lipson, H.: Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579 (2015)
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: ECCV (2014)
Acknowledgements
This work is partly supported by IMPRINT. Praveen Krishnan is supported by Amazon Alexa Graduate Fellowship.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Krishnan, P., Jawahar, C.V. HWNet v2: an efficient word image representation for handwritten documents. IJDAR 22, 387–405 (2019). https://doi.org/10.1007/s10032-019-00336-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-019-00336-x