Scale Space Technique for Word Segmentation in Handwritten Documents
- 35 Citations
- 1.3k Downloads
Abstract
Indexing large archives of historical manuscripts, like the pa- pers of George Washington, is required to allow rapid perusal by scholars and researchers who wish to consult the original manuscripts. Presently, such large archives are indexed manually. Since optical character recog- nition (OCR) works poorly with handwriting, a scheme based on match- ing word images called word spotting has been suggested previously for indexing such documents. The important steps in this scheme are seg- mentation of a document page into words and creation of lists containing instances of the same word by word image matching.
We have developed a novel methodology for segmenting handwritten document images by analyzing the extent of “blobs” in a scale space representationof the image. We believe this is the first application of scale space to this problem. The algorithm has been applied to around 30 grey level images randomly picked from different sections of the George Washington corpus of 6,400 handwritten document images. An accuracy of 77 – 96 percent was observed with an average accuracy of around 87 percent. The algorithm works well in the presence of noise, shine through and other artifacts which may arise due aging and degradation of the page over a couple of centuries or through the man made processes of photocopying and scanning.
Keywords
Scale Space Document Image Line Image Word Segmentation Word ImagePreview
Unable to display preview. Download preview PDF.
References
- 1.A.J. Robinson A.W. Senior. An off-line cursive handwriting recognition system. IEEE transactions on PAMI, 3:309–321, 1998.Google Scholar
- 2.D. Blostein and N. Ahuja. A multi-scale region detector. CVGIP, 45:22–41, January 1989.Google Scholar
- 3.R. G. Casey and E. Lecolinet. A survey of methods and strategies in character segmentation. IEEE Transactions on PAMI, 18:690–706, July 1996.Google Scholar
- 4.L. M. J. Florack. The Syntactic Structure of Scalar Images. Kluwer Academic Publishers, 1997.Google Scholar
- 5.J. Ha, R. M. Haralick, and I. T. Phillips. Document page decomposition by the bounding-box projection technique. In ICDAR, pages 1119–1122, 1995.Google Scholar
- 6.T. Lindeberg. Scale-space theory in computer vision. Kluwer Academic Publishers, 1994.Google Scholar
- 7.U. Mahadevan and R. C. Nagabushnam. Gap metrics for word separation in handwritten lines. In ICDAR, pages 124–127, 1995.Google Scholar
- 8.R. Manmatha and W. B. Croft. Word spotting: Indexing handwritten manuscripts. In Mark Maybury, editor, Intelligent Multi-media Information Re-trieval. AAAI/MIT press, April 1998.Google Scholar
- 9.G. Seni and E. Cohen. External word segmentation of off-line handwritten text lines. Pattern Recognition, 27:41–52, 1994.CrossRefGoogle Scholar
- 10.S. Srihari and G. Kim. Penman: A system for reading unconstrained handwritten page images. In Symposium on document image understanding technology (SDIUT 97), pages 142–153, April 1997.Google Scholar
- 11.N. Srimal. Indexing handwritten documents, M.S. Thesis, University of Massachusetts Computer Science Tech Report. 1999.Google Scholar
- 12.J. A. Weickert, S. Ishikawa, and A. Imiya. On the history of gaussian scale-space axiomatics. In J. Sporring, M. Nielsen, L. M. J. Florack, and P. Johansen, editors, Gaussian Scale-Space Theory, pages 45–59. Kluwer Academic Press, 19Google Scholar