Information Retrieval

, Volume 3, Issue 3, pp 189–216

Information Retrieval can Cope with Many Errors

Authors

  • Elke Mittendorf
  • Peter Schäuble
    • Eurospider Information Technology AG
Article

DOI: 10.1023/A:1026564708926

Cite this article as:
Mittendorf, E. & Schäuble, P. Information Retrieval (2000) 3: 189. doi:10.1023/A:1026564708926

Abstract

The retrieval of documents that originate from digitized and OCR-converted paper documents is an important task for modern retrieval systems. The problems that OCR errors cause for the retrieval process have been subject to research for several years now. We approach the problem from a theoretical point of view and model OCR conversion as a random experiment. Our theoretical results, which are supported by experiments, show clearly that information retrieval can cope even with many errors. It is, however, important that the documents are not too short and that recognition errors are distributed appropriately among words and documents. These results disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care.

probabilistic modellingretrieval effectivenessoptical character recognitiondata corruption

Copyright information

© Kluwer Academic Publishers 2000