Word Spotting for Indic Documents to Facilitate Retrieval
With advances in the field of digitization of printed documents and several mass digitization projects underway, information retrieval and document search have emerged as key research areas. However, most of the current work in these areas is limited to English and a few oriental languages. The lack of efficient solutions for Indic scripts has hampered information extraction from a large body of documents of cultural and historical importance. This chapter presents two relevant topics in this area. First, we describe the use of a script-specific keyword spotting for Devanagari documents that makes use of domain knowledge of the script. Second, we address the needs of a digital library to provide access to a collection of documents from multiple scripts. This requires intelligent solutions which scale across different scripts. We present a script-independent keyword spotting approach for this purpose. Experimental results illustrate the efficacy of our methods.
KeywordsDocument analysis Keyword spotting Optical character recognition Document retrieval Indic scripts
This material is based upon work supported by the National Science Foundation under grant no. IIS-0112059, IIS-0535038, and IIS-0849511.
- 1.N. R. Howe, T. M. Rath and R. Manmatha. Boosted decision trees for word recognition in handwritten document retrievals. In Proceedings of the SIGIR, pp. 377–383, 2005.Google Scholar
- 2.D. R. Lee, W. Y. Kim and I. S. Oh. Hangul document image retrieval system using rank-based recognition. In Proceedings of the International Conference on Document Analysis and Recognition, vol. 2, pp. 615–619, 2005.Google Scholar
- 3.T. M. Rath, R. Manmatha and V. Layrenko. A search engine for historical manuscripts. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2004.Google Scholar
- 4.M. Burl and P. Perona. Using hierarchical shape models to spot keywords in cursive handwriting. In IEEECS Conference on Computer Vision and Pattern Recognition, pp. 535–540, 1998.Google Scholar
- 5.J. L. Decurtins and E. C. Chen. Keyword spotting via word shape recognition. In Proceedings of SPIE Document Recognition II, L. M. Vincent, H. S. Baird; Eds., vol. 2422, pp. 270–277, 1995.Google Scholar
- 6.T. M. Rath and R. Manmatha. Word image matching using dynamic time warping. In Proceedings of the Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 521–527, 2003.Google Scholar
- 7.H. Cao and V. Govindaraju. Template-free word spotting in low-quality manuscripts. In Proceedings of the 6th International Conference on Advances in Pattern Recognition, pp. 135–139, 2007.Google Scholar
- 8.T. Rath and R. Manmatha. Features for word spotting in historical manuscripts. In Proceedings of the 7th International Conference on Document Analysis and Recognition, pages 218–222, 2003.Google Scholar
- 9.S. N. Srihari, H. Srinivasan, C. Huang and S. Shetty. Spotting words in Latin, Devanagari and Arabic scripts. Vivek: Indian Journal of Artificial Intelligence, Vol. 16, no. 3, pp. 2–9, 2006.Google Scholar
- 10.A. Bhardwaj, S. Kompalli, S. Setlur and V. Govindaraju. An OCR based approach to word spotting in Devanagari documents. In Proceedings of the 15th SPIE – Document Recognition and Retrieval, vol. 6815, 2008.Google Scholar