Information Retrieval

, Volume 3, Issue 3, pp 189–216

Information Retrieval can Cope with Many Errors

  • Elke Mittendorf
  • Peter Schäuble
Article

Abstract

The retrieval of documents that originate from digitized and OCR-converted paper documents is an important task for modern retrieval systems. The problems that OCR errors cause for the retrieval process have been subject to research for several years now. We approach the problem from a theoretical point of view and model OCR conversion as a random experiment. Our theoretical results, which are supported by experiments, show clearly that information retrieval can cope even with many errors. It is, however, important that the documents are not too short and that recognition errors are distributed appropriately among words and documents. These results disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care.

probabilistic modelling retrieval effectiveness optical character recognition data corruption 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ballerini J-P, Büchel M, Domenig R, Knaus D, Mateev B, Mittendorf E, Schäuble P, Sheridan P and Wechsler M (1997) SPIDER retrieval system at TREC-5. In: TREC-5 Proceedings.Google Scholar
  2. Cavnar WB (1992) N-gram-based text filtering for TREC-2. In: TREC-2 Proceedings.Google Scholar
  3. Croft WB, Harding S, Taghva K and Borsack J (1993) An evaluation of information retrieval accuracy with simulated OCR output. In: Symposium on Document Analysis and Information Retrieval, pp. 115-126.Google Scholar
  4. Efthimiadis E (1996) Query expansion. Annual Review of Information Science and Technology, 31:121-187.Google Scholar
  5. Frei HP and Qui Y (1993) Effectiveness of weighted retrieval in an operational IR environment. In: Information Retrieval '93. Universit¨atsverlag Konstanz, pp. 41-45.Google Scholar
  6. Fuhr N (1992) Probabilistic models in information retrieval. The Computer Journal, 35(3):243-255.Google Scholar
  7. Garzotto A (1994) Vollautomatische Erkennung von Schriftzeichen in gedrucktem Schrittgut. PhD Thesis, Universit ät Zörich.Google Scholar
  8. Glavitsch U, Schäuble P and Wechsler M (1994) Metadata for integrating speech documents in a text retrieval system. SIGMOD RECORD, 23(4):57-63.Google Scholar
  9. Harding SM, Croft WB and Wein C (1997) Probabilistic retrieval OCR degraded text using n-grams. In: Research and Advanced Technology for Digital Libraries, First European Conference, ECDL'97, pp. 345-359.Google Scholar
  10. Jäger Th (1996) OCR and voting shell fulfilling specific text analysis requirements. In: Symposium on Document Analysis and Information Retrieval, pp. 287-302.Google Scholar
  11. Jones GJF, Foote JT, Sparck Jones K and Young SJ (1996) Retrieving spoken documents by combining multiple index sources. In: ACM SIGIR Conference on R&D in Information Retrieval, Zurich, pp. 30-38.Google Scholar
  12. Mittendorf E (1998) Data corruption and information retrieval. PhD Thesis, ETH Zurich, Institute of Computer Systems.Google Scholar
  13. Mittendorf E and Schäuble P (1996) Measuring the effects of data corruption on information retrieval. In: Symposium on Document Analysis and Information Retrieval, pp. 179-189.Google Scholar
  14. Mittendorf E, Schäuble P and Sheridan P (1995) Applying probabilistic term weighting to OCR text in the case of a large alphabetic library catalogue. In: ACMSIGIR Conference on R&D in Information Retrieval, pp. 328-335.Google Scholar
  15. Myka A and Göntzer U (1995) Automatic hypertext conversion of paper document collections. In: Adam N, Bhargava B and Yesha Y, Eds., Advances in Digital Libraries-Current Issuses, Springer-Verlag, Berlin, pp. 65-90. Lecture Notes in Computer Science, Vol. 916.Google Scholar
  16. Porter MF (1980) An algorithm for suffix stripping. Program, 14(3):130-137.Google Scholar
  17. Robertson SE and Walker S (1994) Some simple effective approximations of the 2-Poisson model for probabilistic weighted retrieval. In: ACM SIGIR Conference on R&D in Information Retrieval, pp. 232-241.Google Scholar
  18. Salton G (1971) The SMART Retrieval System-Experiments in Automatic Document Processing. Prentice Hall, Englewood, Cliffs, New Jersey.Google Scholar
  19. Salton G (1990) Automatic Text Processing. Addison-Wesley, Reading, MA.Google Scholar
  20. Sanderson M (1994) Word sense disambiguation and information retrieval. In: ACM SIGIR Conference on R&D in Information Retrieval, pp. 142-151.Google Scholar
  21. Schäuble P and Glavitsch U (1994) Assessing the retrieval effectiveness of a speech retrieval system by simulating recognition errors. In: ARPA Workshop on Human Language Technology (HLT'94), pp. 370-372.Google Scholar
  22. Singhal A, Buckley C and Mitra M (1996) Pivoted document length normalization. In: ACM SIGIR Conference on R&D in Information Retrieval, pp. 21-29.Google Scholar
  23. Smith S and Stanfill C (1988) An analysis of the effects of data corruption on text retrieval performance. Thinking Machines Corporation, Cambridge, MA.Google Scholar
  24. Stahel W (1995) Statistische Datenanalyse: Eine Einföhrung för naturwissenschaftler. Lehrbuch, Angewandte Mathematik. Vieweg, Wiesbaden.Google Scholar
  25. Taghva K, Borsack J and Condit A (1994) Effects of OCR errors on ranking and feedback using the vector space model. Technical Report TR 94-06, University of Nevada, Las Vegas.Google Scholar
  26. Taghva K, Borsack J and Condit A (1994) Results of applying probabilistic IR to OCR text. In: ACM SIGIR Conference on R&D in Information Retrieval, pp. 202-211.Google Scholar
  27. Teufel B (1989) Informationsspuren zum numerischen und graphischen Vergleich von reduzierten nat¨urlichsprachlichen Texten. PhD Thesis, Swiss Federal Institute of Technology, VdF-Verlag, Zörich.Google Scholar
  28. Venables WN and Ripley BD (1994) Modern applied statistics with S-plus. Statistics and Computing. Springer-Verlag, New York.Google Scholar
  29. Voorhees E and Kantor P (1997) TREC-5 confusion track. In: TREC-5 Proceedings.Google Scholar
  30. Wechsler M and Schäuble P (1995) Speech retrieval based on automatic indexing. In: Ruthven Ian Ed., Proceedings of the FinalWorkshop on Multimedia Information Retrieval (MIRO'95), ElectronicWorkshops in Computing, Springer, Glasgow.Google Scholar
  31. Wiedenhfer L, Hein H-G and Dengel A (1995) Post-processing of OCR results for automatic indexing. In: Third International Conference on Document Analysis and Recognition, Montreal, August 1995. IEEE Computer Society Press, Silver Spring, MD, pp. 592-597.Google Scholar
  32. Xu J and Croft WB(1996) Query expansion using local and global document analysis. In:ACMSIGIR Conference on R&D in Information Retrieval, pp. 4-11.Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • Elke Mittendorf
    • 1
  • Peter Schäuble
    • 2
  1. 1.ZürichSwitzerland
  2. 2.Eurospider Information Technology AGZürichSwitzerland

Personalised recommendations