Word spotting for historical documents

Rath, Tony M.; Manmatha, R.

doi:10.1007/s10032-006-0027-8

Tony M. Rath¹ &
R. Manmatha¹

917 Accesses
289 Citations
4 Altmetric
Explore all metrics

An Erratum to this article was published on 14 December 2006

Abstract

Searching and indexing historical handwritten collections are a very challenging problem. We describe an approach called word spotting which involves grouping word images into clusters of similar words by using image matching to find similarity. By annotating “interesting” clusters, an index that links words to the locations where they occur can be built automatically. Image similarities computed using a number of different techniques including dynamic time warping are compared. The word similarities are then used for clustering using both K-means and agglomerative clustering techniques. It is shown in a subset of the George Washington collection that such a word spotting technique can outperform a Hidden Markov Model word-based recognition technique in terms of word error rates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Baeza-Yates R., Ribeiro-Neto B. (1999) Modern Information Retrieval. Addison-Wesley, Reading
Google Scholar
Belongie S., Malik J., Puzicha J. (2002) Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 24(4): 509–522
Article Google Scholar
Chen F.R., Bloomberg D.S., Wilcox L.D. (1995) Spotting phrases in lines of imaged text. Proc. SPIE 2422, 256–269
Article Google Scholar
Govindaraju, V., Xue, H.: Fast handwriting recognition for indexing historical documents. In: Proceedings of the International Workshop on Document Image Analysis for Libraries Palo Alto, pp. 314–320, 23–24 January 2004
Hastie T., Tibshirani R., Friedman J. (2001) The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, Berlin Heidelberg New York
MATH Google Scholar
Heaps H.S. (1978) Information Retrieval: Computational and Theoretical Aspects. Academic, Orlando
MATH Google Scholar
Itakura F. (1975) Minimum prediction residual principle applied to speech recognition. IEEE Trans. Acoust. Speech, Signal Process. 23, 67–72
Article Google Scholar
Jones, G.J.F., Foote, J.T., Jones, K. Sparck, Young, S.J.: Video mail retrieval: the effect of word spotting accuracy on precision. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing vol. 1, pp. 309–316, 8-12 May 1995
Kane, S., Lehman, A., Partridge, E.: Indexing George Washington’s handwritten manuscripts. Technical report, Center for Intelligent Information Retrieval, University of Massachusetts Amherst, 2001
Khoubyari, S., Hull, J.J.: Keyword location in noisy document images. In: Second Annual Symposium on Document Analysis and Information Retrieval Las Vegas, NV, pp. 217–231, 26–18 April 1993
Kim G., Govindaraju V., Srihari S.N. (1999) An architecture for handwritten text recognition systems. Int. J. Document Analysis Recognit. 2(1): 37–44
Article Google Scholar
Kornai, A., Mohiuddin, K.M., Connell, S.D.: Recognition of cursive writing on personal checks. In: Proceedings of the 5th International Workshop on Frontiers in Handwriting Recognition, Colchester, 2–5 September 1996
Lavrenko, V., Rath, T.M., Manmatha, R.: Holistic word recognition for handwritten historical documents. In: Proceedings of the International Workshop on Document Image Analysis for Libraries, Palo Alto, pp. 278–287, 23–24 January 2004
Leedham, G., Varma, S., Patankar, A., Govindaraju, V.: Separating text and background in degraded documents images–a comparison of global thresholding techniques for multi-stage thresholding. In: Proceedings of the 8th International Workshop on Frontiers in Handwriting Recognition, Niagara-on-the-Lake, pp. 244–249, 6–8 August 2002
Luhn H.P. (1958) The automatic creation of literature abstracts. IBM J. 2, 159–165
MathSciNet Google Scholar
Manmatha R., Croft W.B. (1997) Word spotting: indexing handwritten manuscripts. In: Mark T., Maybury (eds) Intelligent Multimedia Information Retrieval. MIT Press, Cambridge, pp. 43–64
Google Scholar
Manmatha R., Rothfeder J. (2005) A scale space approach for segmenting words from historical handwritten documents. IEEE Trans. Pattern Anal. Mach. Intell. 27 (8): 1212–1225
Article Google Scholar
Manmatha, R., Srimal, N.: Scale space technique for word segmentation in handwritten manuscripts. In: Proceedings of the Second International Conference on Scale-Space Theories in Computer Vision, Corfu, pp. 22–33, 26–27 September 1999
Manmatha, R., Han, C., Riseman, E.M.: Word spotting: a new approach to indexing handwriting. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 631–637 (1996)
Manmatha, R., Han, C., Riseman, E.M., Croft, W.B.: Indexing handwriting using word matching. In: Digital Libraries 1996: 1st ACM International Conference on Digital Libraries Bethesda, pp. 151–159, 20–23 March 1996
Marti U.-V., Bunke H. (2001) Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system. Int. J. Pattern Recognit. Artif. Intell. 15(1): 65–90
Article Google Scholar
Nagy G. (2000) Twenty years of document image analysis in pami. IEEE Trans. Pattern Anal. Machine Intell. 22(1): 38–62
Article MathSciNet Google Scholar
Nelder J.A., Mead R. (1965) A simplex method for function minimization. Comput. J. 7, 308–313
MATH Google Scholar
Pacquet T., Lecourtier Y. (1993) Recognition of handwritten sentences using a restricted lexicon. Pattern Recognit. 26(3): 391–407
Article Google Scholar
Plamondon R., Srihari S.N. (2000) On-line and off-line handwriting recognition: a comprehensive survey. IEEE Trans. Pattern Anal. Mach. Intell. 22, 63–84
Article Google Scholar
Ratanamahatana, C.A., Keogh, E.: Making time-series classification more accurate using learned constraints. In: Proceedings of the 4th SIAM International Conference on Data Mining Lake Buena Vista, pp. 11–22, 22–24 April 2004
Rath, T.M.: Retrieval of Handwritten Historical Document Images. PhD Thesis, University of Massachusetts Amherst, September 2005
Rath, T.M., Manmatha, R.: Word image matching using dynamic time warping. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, Madison vol. 2, pp. 521–527, 18–20 June 2003
Rath, T.M., Manmatha, R.: Features for word spotting in historical manuscripts. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, vol. 1, pp. 218–222, 3–6 August 2003
Rath, T.M., Kane, S., Lehman, A., Partridge, E., Manmatha, R.: Indexing for a digital library of George Washington’s manuscripts: a study of word matching techniques. Technical Report, Center for Intelligent Information Retrieval, University of Massachusetts Amherst (2000)
Rath, T.M., Lavrenko, V., Manmatha, R.: A search engine for historical manuscript images. In: Proceedings of the 27th Annual International ACM SIGIR Conference Sheffield pp. 369–376, 25–29 July 2004
Rothfeder, J.L., Feng, S., Rath, T.M.: Using corner feature correspondences to rank word images by similarity. In: Proceedings of the Workshop on Document Image Analysis and Retrieval (electronically published) Madison, 20 June 2003
Sakoe H., Chiba S. (1980) Dynamic programming optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Proces. 26, 623–625
Google Scholar
Salton, G., Buckley, C.: The trec_eval program (contains modifications by other authors)(1991)
Sankoff D., Kruskal J.B. (1983) Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading
Google Scholar
Scott G.L., Longuet-Higgins H.C. (1991) An algorithm for associating the features of two patterns. Proc. R. Soc. Lond. B224, 21–26
Article Google Scholar
Srihari, S., Kim, G. Penman: A system for reading unconstrained handwritten page images. In: Symposium Document Image Understanding Technol (SDIUT 97), pp. 142–153, April 1997
Tomai, C.I., Zhang, B., Govindaraju, V.: Transcript mapping for historic handwritten document images. In: Proceedings of the 8th International Workshop on Frontiers in Handwriting Recognition, Niagara-on-the-Lake, pp. 413–418, 6–8 August 2002
Triebel, R.: Automatische Erkennung von handgeschriebenen Worten mithilfe des Level-building Algorithmus, December 1999. Student Thesis, Institut für Informatik, Albert-Ludwigs-Universität Freiburg (in German)
Trier Ø.D., Jain A.K., Taxt T. (1996) Feature extraction methods for character recognition–a survey. Pattern Recognit. 29(4): 641–662
Article Google Scholar
Vinciarelli, A., Bengio, S., Bunke, H.: Offline recognition of large vocabulary cursive handwritten text. In: Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, vol. 1, pp. 1101–1105, 3–6 August 2003
Zipf G. (1949) Human Behaviour and the Principle of Least Effort. Addison-Wesley, Cambridge
Google Scholar

Download references

Author information

Authors and Affiliations

Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts Amherst, Amherst, MA, 01003, USA
Tony M. Rath & R. Manmatha

Authors

Tony M. Rath
View author publications
You can also search for this author in PubMed Google Scholar
R. Manmatha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tony M. Rath.

Additional information

An erratum to this article can be found at http://dx.doi.org/10.1007/s10032-006-0035-8

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rath, T.M., Manmatha, R. Word spotting for historical documents. IJDAR 9, 139–152 (2007). https://doi.org/10.1007/s10032-006-0027-8

Download citation

Received: 14 February 2005
Revised: 07 November 2005
Accepted: 28 May 2006
Published: 26 August 2006
Issue Date: April 2007
DOI: https://doi.org/10.1007/s10032-006-0027-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Word spotting for historical documents

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Word spotting for historical documents

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation