Probabilistic retrieval of OCR degraded text using N-grams

  • S. M. Harding
  • W. B. Croft
  • C. Weir
Information Retreival II
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1324)


The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A second method of using n-grams to identify appropriate matching and near matching terms for query expansion which also performed better than using standard queries is also described. This method was less effective than direct n-gram query formulations but can likely be improved with alternative query component weighting schemes and measures of term similarity. Finally, a web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described.


Average Precision Document Image Retrieval Performance Query Term Query Expansion 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Callan, J.P., Croft, W.B. and Harding, S.M.: The INQUERY Retrieval System. In Proceedings of the 3rd International Conference on Database and Expert Systems Applications (1992) 78–83.Google Scholar
  2. 2.
    Cavnar, W.: Using an N-Gram-Based Document Representation with a Vector Processing Retrieval Model. In Overview of the Third Text Retrieval Conference (TREC-3), D.K. Harman, Editor (1994) 269–278.Google Scholar
  3. 3.
    Cohen, D.J.: Highlights: Language and Domain-Independent Automatic Indexing Terms for Abstracting. J. Amer. Soc. Info. Sci. 46 (1995) 162–174.CrossRefGoogle Scholar
  4. 4.
    Croft, W.B., Harding, S.M., Taghva, K. and Borsack, J.: An evaluation of Information Retrieval Accuracy with Simulated OCR Output. Symposium of Document Analysis and Information Retrieval (1994).Google Scholar
  5. 5.
    Pierce, C. and Nicholas, C.: TELLTALE: Experiments in a Dynamic Hypertext Environment for Degraded and Multilingual Data. J. Amer. Sec. Info. Sci 47 (1996) 263–275.CrossRefGoogle Scholar
  6. 6.
    Rice, S., Kanai, J. and Nartker, T.: An Evaluation of Information Retrieval Accuracy. In UNLV Information Science Research Institute Annual Report (1993) 9–20.Google Scholar
  7. 7.
    Taghva, K., Borsack, J., Condit, A., Erva, S.: The effects of noisy data on text retrieval. In UNLV Information Science Research Institute Annual Report (1993) 71–80.Google Scholar
  8. 8.
    Taylor, S.L., Lipshutz, M., Dahl. D.A. and Weir, C.: An Intelligent Document Understanding System. In Second International Conference on Document Analysis and Recognition (1993) 107–220.Google Scholar
  9. 9.
    Turtle, H. and Croft, W.B.: Evaluation of an Inference Network-Based Retrieval Model. ACM Trans. on Info. Sys. 9 (1991) 187–222.CrossRefGoogle Scholar
  10. 10.
    Ukkonen, E.: Approximate String-Matching with Q-grams and Maximal Matches. Theor. Comp. Sci. 92 (1992) 191–211.CrossRefGoogle Scholar
  11. 11.
    Weir, C., Taylor, S.L., Harding, S.M. and Croft, W.B.: The Skeleton Document Image Retrieval System. In Symposium on Document Image Understanding Technologies (1997).Google Scholar
  12. 12.
    Zamora, A.: Automatic Detection and Correction of Spelling Errors in a Large Data Base. J. Amer. Soc. Info. Sci. 31 (1980) 51–57.Google Scholar
  13. 13.
    Zobel, J. and Dart, P.: Finding Approximate Matches in Large Lexicons. Soft. Pract. and Exper. 25 (1995) 331–345.Google Scholar
  14. 14.
    Zobel, J. and Dart, P.: Phonetic String Matching: Lessons from Information Retrieval. In Proceedings 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1996) 166–173.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1997

Authors and Affiliations

  • S. M. Harding
    • 1
  • W. B. Croft
    • 1
  • C. Weir
    • 2
  1. 1.CIIRUniversity of MassachusettsAmherstUSA
  2. 2.Lockheed Martin C2 SystemsFrazerUSA

Personalised recommendations