Enabling Search over Large Collections of Telugu Document Images – An Automatic Annotation Based Approach

Pramod Sankar, K.; Jawahar, C. V.

doi:10.1007/11949619_75

K. Pramod Sankar¹⁸ &
C. V. Jawahar¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4338))

1832 Accesses
2 Citations

Abstract

For the first time, search is enabled over a massive collection of 21 Million word images from digitized document images. This work advances the state-of-the-art on multiple fronts: i) Indian language document images are made searchable by textual queries, ii) interactive content-level access is provided to document images for search and retrieval, iii) a novel recognition-free approach, that does not require an OCR, is adapted and validated iv) a suite of image processing and pattern classification algorithms are proposed to efficiently automate the process and v) the scalability of the solution is demonstrated over a large collection of 500 digitised books consisting of 75,000 pages.

Character recognition based approaches yield poor results for developing search engines for Indian language document images, due to the complexity of the script and the poor quality of the documents. Recognition free approaches, based on word-spotting, are not directly scalable to large collections, due to the computational complexity of matching images in the feature space. For example, if it requires 1 mSec to match two images, the retrieval of documents to a single query, from a large collection like ours, would require close to a day’s time. In this paper we propose a novel automatic annotation based approach to provide textual description of document images. With a one time, offline computational effort, we are able to build a text-based retrieval system, over annotated images. This system has an interactive response time of about 0.01 second. However, we pay the price in the form of massive offline computation, which is performed on a cluster of 35 computers, for about a month. Our procedure is highly automatic, requiring minimal human intervention.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Universal Library at: http://www.ulib.org
Ambati, V., Balakrishnan, N.: Reddy, R., Pratha, L., Jawahar, C.V.: The digital library of india project: Process, policies and architecture. In: 2nd International Conference on Digital Libraries(ICDL) (2006)
Google Scholar
Pramod Sankar, K., Ambati, V., Pratha, L., Jawahar, C.V.: Digitizing a million books: Challenges for document analysis. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 425–436. Springer, Heidelberg (2006)
Chapter Google Scholar
Mitra, M., Chaudhuri, B.B.: Information retrieval from documents: A survey. Inf. Retr. 2, 141–163 (2000)
Article Google Scholar
Doermann, D.: The indexing and retrieval of document images: A survey. Computer Vision and Image Understanding (CVIU) 70, 287–298 (1998)
Article Google Scholar
Taghva, K., Borsack, J., Condit, A.: Evaluation of model-based retrieval effectiveness with ocr text. ACM Trans. Inf. Syst. 14, 64–93 (1996)
Article Google Scholar
Rath, T., Manmatha, R.: Word image matching using dynamic time warping. Proc. Computer Vision and Pattern Recognition (CVPR) 2, 521–527 (2003)
Google Scholar
Marinai, S., Marino, E., Soda, G.: Font adaptive word indexing of modern printed documents. IEEE Trans. Pattern Anal. Mach. Intell. 28, 1187–1199 (2006)
Article Google Scholar
Harit, G., Chaudhury, S., Ghosh, H.: Managing document images in a digital library: An ontology guided approach. In: DIAL 2004: Proc. of the First International Workshop on Document Image Analysis for Libraries., p. 64 (2004)
Google Scholar
Jawahar, C.V., Meshesha, M., Balasubramanian, A.: Searching in document images. In: 4th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), pp. 622–627 (2004)
Google Scholar
Srihari, S.N., Huang, C., Srinivasan, H.: Search engine for handwritten documents. In: Document Recognition and Retrieval. SPIE, vol. 5676, pp. 66–75 (2005)
Google Scholar
Pal, U., Chaudhuri, B.B.: Indian script character recognition: a survey. Pattern Recognition 37, 1887–1899 (2004)
Article Google Scholar
Schmid, C., Mohr, R.: Local grayvalue invariants for image retrieval. IEEE Transactions on Pattern Analysis & Machine Intelligence 19, 530–534 (1997)
Article Google Scholar
Sivic, J., Zisserman, A.: Video Google: A text retrieval approach to object matching in videos. In: Proc. ICCV, vol. 2, pp. 1470–1477 (2003)
Google Scholar
Pramod Sankar, K., Meshesha, M., Jawahar, C.V.: Annotation of images and videos based on textual content without OCR. In: Proc. ECCV Workshop on Computation Intensive Methods in Computer Vision (2006)
Google Scholar
Duygulu, P., Barnard, K., de Freitas, N., Forsyth, D.: Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In: European Conference on Computer Vision, pp. 97–112 (2002)
Google Scholar
Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: ACM SIGIR, pp. 119–126 (2003)
Google Scholar
Wenyin, L., Dumais, S., Sun, Y., Zhang, H., Czerwinski, M., Field, B.: Semi-automatic image annotation. In: Proc. of Interact: Conference on HCI, pp. 326–333 (2001)
Google Scholar
Wang, X., Zhang, L., Jing, F., Ma, W.Y.: Annosearch: Image auto-annotation by search. In: Proc. CVPR, New York, USA, June 2006, pp. 1483–1490 (2006)
Google Scholar
Digital Library of India at: http://dli.iiit.ac.in
O’Gorman, L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15, 1162–1173 (1993)
Article Google Scholar
Zipf, G.: Human Behaviour and the Principle of Least Effort. Addison-Wesley, Cambridge (1949)
Google Scholar
Balasubramanian, A., Meshesha, M., Jawahar, C.V.: Retrieval from document image collections. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 1–12. Springer, Heidelberg (2006)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Visual Information Technology, International Institute of Information Technology, Hyderabad, India
K. Pramod Sankar & C. V. Jawahar

Authors

K. Pramod Sankar
View author publications
You can also search for this author in PubMed Google Scholar
C. V. Jawahar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, IIT Delhi, New Delhi, India
Prem K. Kalra
School of Computer Science and Engineering, The Hebrew University of Jerusalem, 91904, Jerusalem, Israel
Shmuel Peleg

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pramod Sankar, K., Jawahar, C.V. (2006). Enabling Search over Large Collections of Telugu Document Images – An Automatic Annotation Based Approach. In: Kalra, P.K., Peleg, S. (eds) Computer Vision, Graphics and Image Processing. Lecture Notes in Computer Science, vol 4338. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11949619_75

Download citation

DOI: https://doi.org/10.1007/11949619_75
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68301-8
Online ISBN: 978-3-540-68302-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics