Information Retrieval

, Volume 2, Issue 2–3, pp 141–163

Information Retrieval from Documents: A Survey

  • M. Mitra
  • B.B. Chaudhuri

DOI: 10.1023/A:1009950525500

Cite this article as:
Mitra, M. & Chaudhuri, B. Information Retrieval (2000) 2: 141. doi:10.1023/A:1009950525500


Given the phenomenal growth in the variety and quantity of data available to users through electronic media, there is a great demand for efficient and effective ways to organize and search through all this information. Besides speech, our principal means of communication is through visual media, and in particular, through documents. In this paper, we provide an update on Doermann's comprehensive survey (1998) of research results in the broad area of document-based information retrieval. The scope of this survey is also somewhat broader, and there is a greater emphasis on relating document image analysis methods to conventional IR methods.

Documents are available in a wide variety of formats. Technical papers are often available as ASCII files of clean, correct, text. Other documents may only be available as hardcopies. These documents have to be scanned and stored as images so that they may be processed by a computer. The textual content of these documents may also be extracted and recognized using OCR methods. Our survey covers the broad spectrum of methods that are required to handle different formats like text and images. The core of the paper focuses on methods that manipulate document images directly, and perform various information processing tasks such as retrieval, categorization, and summarization, without attempting to completely recognize the textual content of the document. We start, however, with a brief overview of traditional IR techniques that operate on clean text. We also discuss research dealing with text that is generated by running OCR on document images. Finally, we also briefly touch on the related problem of content-based image retrieval.

text retrieval optical character recognition document image analysis image retrieval 

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • M. Mitra
    • 1
  • B.B. Chaudhuri
    • 1
  1. 1.Indian Statistical InstituteCalcutta

Personalised recommendations