Information Retrieval

, Volume 3, Issue 3, pp 189-216

First online:

Information Retrieval can Cope with Many Errors

  • Elke MittendorfAffiliated with
  • , Peter SchäubleAffiliated withEurospider Information Technology AG

Rent the article at a discount

Rent now

* Final gross prices may vary according to local VAT.

Get Access


The retrieval of documents that originate from digitized and OCR-converted paper documents is an important task for modern retrieval systems. The problems that OCR errors cause for the retrieval process have been subject to research for several years now. We approach the problem from a theoretical point of view and model OCR conversion as a random experiment. Our theoretical results, which are supported by experiments, show clearly that information retrieval can cope even with many errors. It is, however, important that the documents are not too short and that recognition errors are distributed appropriately among words and documents. These results disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care.

probabilistic modelling retrieval effectiveness optical character recognition data corruption