Skip to main content

Enabling Search over Large Collections of Telugu Document Images – An Automatic Annotation Based Approach

  • Conference paper
Computer Vision, Graphics and Image Processing

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 4338))

Abstract

For the first time, search is enabled over a massive collection of 21 Million word images from digitized document images. This work advances the state-of-the-art on multiple fronts: i) Indian language document images are made searchable by textual queries, ii) interactive content-level access is provided to document images for search and retrieval, iii) a novel recognition-free approach, that does not require an OCR, is adapted and validated iv) a suite of image processing and pattern classification algorithms are proposed to efficiently automate the process and v) the scalability of the solution is demonstrated over a large collection of 500 digitised books consisting of 75,000 pages.

Character recognition based approaches yield poor results for developing search engines for Indian language document images, due to the complexity of the script and the poor quality of the documents. Recognition free approaches, based on word-spotting, are not directly scalable to large collections, due to the computational complexity of matching images in the feature space. For example, if it requires 1 mSec to match two images, the retrieval of documents to a single query, from a large collection like ours, would require close to a day’s time. In this paper we propose a novel automatic annotation based approach to provide textual description of document images. With a one time, offline computational effort, we are able to build a text-based retrieval system, over annotated images. This system has an interactive response time of about 0.01 second. However, we pay the price in the form of massive offline computation, which is performed on a cluster of 35 computers, for about a month. Our procedure is highly automatic, requiring minimal human intervention.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Universal Library at: http://www.ulib.org

  2. Ambati, V., Balakrishnan, N.: Reddy, R., Pratha, L., Jawahar, C.V.: The digital library of india project: Process, policies and architecture. In: 2nd International Conference on Digital Libraries(ICDL) (2006)

    Google Scholar 

  3. Pramod Sankar, K., Ambati, V., Pratha, L., Jawahar, C.V.: Digitizing a million books: Challenges for document analysis. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 425–436. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  4. Mitra, M., Chaudhuri, B.B.: Information retrieval from documents: A survey. Inf. Retr. 2, 141–163 (2000)

    Article  Google Scholar 

  5. Doermann, D.: The indexing and retrieval of document images: A survey. Computer Vision and Image Understanding (CVIU) 70, 287–298 (1998)

    Article  Google Scholar 

  6. Taghva, K., Borsack, J., Condit, A.: Evaluation of model-based retrieval effectiveness with ocr text. ACM Trans. Inf. Syst. 14, 64–93 (1996)

    Article  Google Scholar 

  7. Rath, T., Manmatha, R.: Word image matching using dynamic time warping. Proc. Computer Vision and Pattern Recognition (CVPR) 2, 521–527 (2003)

    Google Scholar 

  8. Marinai, S., Marino, E., Soda, G.: Font adaptive word indexing of modern printed documents. IEEE Trans. Pattern Anal. Mach. Intell. 28, 1187–1199 (2006)

    Article  Google Scholar 

  9. Harit, G., Chaudhury, S., Ghosh, H.: Managing document images in a digital library: An ontology guided approach. In: DIAL 2004: Proc. of the First International Workshop on Document Image Analysis for Libraries., p. 64 (2004)

    Google Scholar 

  10. Jawahar, C.V., Meshesha, M., Balasubramanian, A.: Searching in document images. In: 4th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), pp. 622–627 (2004)

    Google Scholar 

  11. Srihari, S.N., Huang, C., Srinivasan, H.: Search engine for handwritten documents. In: Document Recognition and Retrieval. SPIE, vol. 5676, pp. 66–75 (2005)

    Google Scholar 

  12. Pal, U., Chaudhuri, B.B.: Indian script character recognition: a survey. Pattern Recognition 37, 1887–1899 (2004)

    Article  Google Scholar 

  13. Schmid, C., Mohr, R.: Local grayvalue invariants for image retrieval. IEEE Transactions on Pattern Analysis & Machine Intelligence 19, 530–534 (1997)

    Article  Google Scholar 

  14. Sivic, J., Zisserman, A.: Video Google: A text retrieval approach to object matching in videos. In: Proc. ICCV, vol. 2, pp. 1470–1477 (2003)

    Google Scholar 

  15. Pramod Sankar, K., Meshesha, M., Jawahar, C.V.: Annotation of images and videos based on textual content without OCR. In: Proc. ECCV Workshop on Computation Intensive Methods in Computer Vision (2006)

    Google Scholar 

  16. Duygulu, P., Barnard, K., de Freitas, N., Forsyth, D.: Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In: European Conference on Computer Vision, pp. 97–112 (2002)

    Google Scholar 

  17. Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: ACM SIGIR, pp. 119–126 (2003)

    Google Scholar 

  18. Wenyin, L., Dumais, S., Sun, Y., Zhang, H., Czerwinski, M., Field, B.: Semi-automatic image annotation. In: Proc. of Interact: Conference on HCI, pp. 326–333 (2001)

    Google Scholar 

  19. Wang, X., Zhang, L., Jing, F., Ma, W.Y.: Annosearch: Image auto-annotation by search. In: Proc. CVPR, New York, USA, June 2006, pp. 1483–1490 (2006)

    Google Scholar 

  20. Digital Library of India at: http://dli.iiit.ac.in

  21. O’Gorman, L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15, 1162–1173 (1993)

    Article  Google Scholar 

  22. Zipf, G.: Human Behaviour and the Principle of Least Effort. Addison-Wesley, Cambridge (1949)

    Google Scholar 

  23. Balasubramanian, A., Meshesha, M., Jawahar, C.V.: Retrieval from document image collections. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 1–12. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pramod Sankar, K., Jawahar, C.V. (2006). Enabling Search over Large Collections of Telugu Document Images – An Automatic Annotation Based Approach. In: Kalra, P.K., Peleg, S. (eds) Computer Vision, Graphics and Image Processing. Lecture Notes in Computer Science, vol 4338. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11949619_75

Download citation

  • DOI: https://doi.org/10.1007/11949619_75

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68301-8

  • Online ISBN: 978-3-540-68302-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics