Abstract
With advances in the field of digitization of printed documents and several mass digitization projects underway, information retrieval and document search have emerged as key research areas. However, most of the current work in these areas is limited to English and a few oriental languages. The lack of efficient solutions for Indic scripts has hampered information extraction from a large body of documents of cultural and historical importance. This chapter presents two relevant topics in this area. First, we describe the use of a script-specific keyword spotting for Devanagari documents that makes use of domain knowledge of the script. Second, we address the needs of a digital library to provide access to a collection of documents from multiple scripts. This requires intelligent solutions which scale across different scripts. We present a script-independent keyword spotting approach for this purpose. Experimental results illustrate the efficacy of our methods.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
N. R. Howe, T. M. Rath and R. Manmatha. Boosted decision trees for word recognition in handwritten document retrievals. In Proceedings of the SIGIR, pp. 377–383, 2005.
D. R. Lee, W. Y. Kim and I. S. Oh. Hangul document image retrieval system using rank-based recognition. In Proceedings of the International Conference on Document Analysis and Recognition, vol. 2, pp. 615–619, 2005.
T. M. Rath, R. Manmatha and V. Layrenko. A search engine for historical manuscripts. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2004.
M. Burl and P. Perona. Using hierarchical shape models to spot keywords in cursive handwriting. In IEEECS Conference on Computer Vision and Pattern Recognition, pp. 535–540, 1998.
J. L. Decurtins and E. C. Chen. Keyword spotting via word shape recognition. In Proceedings of SPIE Document Recognition II, L. M. Vincent, H. S. Baird; Eds., vol. 2422, pp. 270–277, 1995.
T. M. Rath and R. Manmatha. Word image matching using dynamic time warping. In Proceedings of the Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 521–527, 2003.
H. Cao and V. Govindaraju. Template-free word spotting in low-quality manuscripts. In Proceedings of the 6th International Conference on Advances in Pattern Recognition, pp. 135–139, 2007.
T. Rath and R. Manmatha. Features for word spotting in historical manuscripts. In Proceedings of the 7th International Conference on Document Analysis and Recognition, pages 218–222, 2003.
S. N. Srihari, H. Srinivasan, C. Huang and S. Shetty. Spotting words in Latin, Devanagari and Arabic scripts. Vivek: Indian Journal of Artificial Intelligence, Vol. 16, no. 3, pp. 2–9, 2006.
A. Bhardwaj, S. Kompalli, S. Setlur and V. Govindaraju. An OCR based approach to word spotting in Devanagari documents. In Proceedings of the 15th SPIE – Document Recognition and Retrieval, vol. 6815, 2008.
C.-H. Teh and R. T. Chin On image analysis by the methods of moments. IEEE Trans actions on Pattern Analysis and Machine Intelligence, 10(4), 496–513, 1988.
Franz L. Alt. Digital pattern recognition by moments. The Journal of the ACM, 9(2), 240–258, 1962.
Acknowledgment
This material is based upon work supported by the National Science Foundation under grant no. IIS-0112059, IIS-0535038, and IIS-0849511.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag London Limited
About this chapter
Cite this chapter
Bhardwaj, A., Setlur, S., Govindaraju, V. (2009). Word Spotting for Indic Documents to Facilitate Retrieval. In: Govindaraju, V., Setlur, S. (eds) Guide to OCR for Indic Scripts. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84800-330-9_15
Download citation
DOI: https://doi.org/10.1007/978-1-84800-330-9_15
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-84800-329-3
Online ISBN: 978-1-84800-330-9
eBook Packages: Computer ScienceComputer Science (R0)