Enabling Search over Large Collections of Telugu Document Images – An Automatic Annotation Based Approach

  • K. Pramod Sankar
  • C. V. Jawahar
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4338)


For the first time, search is enabled over a massive collection of 21 Million word images from digitized document images. This work advances the state-of-the-art on multiple fronts: i) Indian language document images are made searchable by textual queries, ii) interactive content-level access is provided to document images for search and retrieval, iii) a novel recognition-free approach, that does not require an OCR, is adapted and validated iv) a suite of image processing and pattern classification algorithms are proposed to efficiently automate the process and v) the scalability of the solution is demonstrated over a large collection of 500 digitised books consisting of 75,000 pages.

Character recognition based approaches yield poor results for developing search engines for Indian language document images, due to the complexity of the script and the poor quality of the documents. Recognition free approaches, based on word-spotting, are not directly scalable to large collections, due to the computational complexity of matching images in the feature space. For example, if it requires 1 mSec to match two images, the retrieval of documents to a single query, from a large collection like ours, would require close to a day’s time. In this paper we propose a novel automatic annotation based approach to provide textual description of document images. With a one time, offline computational effort, we are able to build a text-based retrieval system, over annotated images. This system has an interactive response time of about 0.01 second. However, we pay the price in the form of massive offline computation, which is performed on a cluster of 35 computers, for about a month. Our procedure is highly automatic, requiring minimal human intervention.


Digital Library Document Image Dynamic Time Warping Optical Character Recognition Cluster Centroid 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Universal Library at:
  2. 2.
    Ambati, V., Balakrishnan, N.: Reddy, R., Pratha, L., Jawahar, C.V.: The digital library of india project: Process, policies and architecture. In: 2nd International Conference on Digital Libraries(ICDL) (2006)Google Scholar
  3. 3.
    Pramod Sankar, K., Ambati, V., Pratha, L., Jawahar, C.V.: Digitizing a million books: Challenges for document analysis. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 425–436. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. 4.
    Mitra, M., Chaudhuri, B.B.: Information retrieval from documents: A survey. Inf. Retr. 2, 141–163 (2000)CrossRefGoogle Scholar
  5. 5.
    Doermann, D.: The indexing and retrieval of document images: A survey. Computer Vision and Image Understanding (CVIU) 70, 287–298 (1998)CrossRefGoogle Scholar
  6. 6.
    Taghva, K., Borsack, J., Condit, A.: Evaluation of model-based retrieval effectiveness with ocr text. ACM Trans. Inf. Syst. 14, 64–93 (1996)CrossRefGoogle Scholar
  7. 7.
    Rath, T., Manmatha, R.: Word image matching using dynamic time warping. Proc. Computer Vision and Pattern Recognition (CVPR) 2, 521–527 (2003)Google Scholar
  8. 8.
    Marinai, S., Marino, E., Soda, G.: Font adaptive word indexing of modern printed documents. IEEE Trans. Pattern Anal. Mach. Intell. 28, 1187–1199 (2006)CrossRefGoogle Scholar
  9. 9.
    Harit, G., Chaudhury, S., Ghosh, H.: Managing document images in a digital library: An ontology guided approach. In: DIAL 2004: Proc. of the First International Workshop on Document Image Analysis for Libraries., p. 64 (2004)Google Scholar
  10. 10.
    Jawahar, C.V., Meshesha, M., Balasubramanian, A.: Searching in document images. In: 4th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP), pp. 622–627 (2004)Google Scholar
  11. 11.
    Srihari, S.N., Huang, C., Srinivasan, H.: Search engine for handwritten documents. In: Document Recognition and Retrieval. SPIE, vol. 5676, pp. 66–75 (2005)Google Scholar
  12. 12.
    Pal, U., Chaudhuri, B.B.: Indian script character recognition: a survey. Pattern Recognition 37, 1887–1899 (2004)CrossRefGoogle Scholar
  13. 13.
    Schmid, C., Mohr, R.: Local grayvalue invariants for image retrieval. IEEE Transactions on Pattern Analysis & Machine Intelligence 19, 530–534 (1997)CrossRefGoogle Scholar
  14. 14.
    Sivic, J., Zisserman, A.: Video Google: A text retrieval approach to object matching in videos. In: Proc. ICCV, vol. 2, pp. 1470–1477 (2003)Google Scholar
  15. 15.
    Pramod Sankar, K., Meshesha, M., Jawahar, C.V.: Annotation of images and videos based on textual content without OCR. In: Proc. ECCV Workshop on Computation Intensive Methods in Computer Vision (2006)Google Scholar
  16. 16.
    Duygulu, P., Barnard, K., de Freitas, N., Forsyth, D.: Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In: European Conference on Computer Vision, pp. 97–112 (2002)Google Scholar
  17. 17.
    Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: ACM SIGIR, pp. 119–126 (2003)Google Scholar
  18. 18.
    Wenyin, L., Dumais, S., Sun, Y., Zhang, H., Czerwinski, M., Field, B.: Semi-automatic image annotation. In: Proc. of Interact: Conference on HCI, pp. 326–333 (2001)Google Scholar
  19. 19.
    Wang, X., Zhang, L., Jing, F., Ma, W.Y.: Annosearch: Image auto-annotation by search. In: Proc. CVPR, New York, USA, June 2006, pp. 1483–1490 (2006)Google Scholar
  20. 20.
    Digital Library of India at:
  21. 21.
    O’Gorman, L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15, 1162–1173 (1993)CrossRefGoogle Scholar
  22. 22.
    Zipf, G.: Human Behaviour and the Principle of Least Effort. Addison-Wesley, Cambridge (1949)Google Scholar
  23. 23.
    Balasubramanian, A., Meshesha, M., Jawahar, C.V.: Retrieval from document image collections. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 1–12. Springer, Heidelberg (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • K. Pramod Sankar
    • 1
  • C. V. Jawahar
    • 1
  1. 1.Centre for Visual Information TechnologyInternational Institute of Information TechnologyHyderabadIndia

Personalised recommendations