The Effects of OCR Error on the Extraction of Private Information

  • Kazem Taghva
  • Russell Beckley
  • Jeffrey Coombs
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3872)

Abstract

OCR error has been shown not to affect the average accuracy of text retrieval or text categorization.Recent studies however have indicated that information extraction is significantly degraded by OCR error. We experimented with information extraction software on two collections, one with OCR-ed documents and another with manually-corrected versions of the former. We discovered a significant reduction in accuracy on the OCR text versus the corrected text. The majority of errors were attributable to zoning problems rather than OCR classification errors.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    U.S. Government. The freedom of information act 5 U.S.C. sec. 552 as amended in 2002 (Viewed June 30 2004), http://www.usdoj.gov/oip/foia_updates/Vol_XVII_4/page2.htm
  2. 2.
    U.S. Government. Frequently occurring first names and surnames from the 1990 census (Viewed August, 2005), http://www.census.gov/genealogy/www/freqnames.html
  3. 3.
    Grishman, R.: Information extraction: Techniques and challenges. In: Pazienza, M.T. (ed.) SCIE 1997. LNCS, vol. 1299, pp. 10–27. Springer, Heidelberg (1997)Google Scholar
  4. 4.
    Jing, H., Lopresti, D., Shih, C.: Summarizing noisy documents. In: Proceedings of SDIUT 2003, Greenbelt, MD, April 2003, pp. 111–119 (2003)Google Scholar
  5. 5.
    McCallum, A.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), http://www.cs.cmu.edu/~mccallum/bow
  6. 6.
    Mendenhall, W., Sincich, T.: Statistics for Engineering and the Sciences, 4th edn. Prentice Hall, Englewood Cliffs (1995)MATHGoogle Scholar
  7. 7.
    Miller, D., Boisen, S., Schwartz, R., Stone, R., Weischedel, R.: Named entity extraction from noisy input: Speech and OCR. In: Proceedings of the Sixth Conference on Applied Natural Languae Processing, pp. 316–324 (2000)Google Scholar
  8. 8.
    Mooney, R., Bunescu, R.: Mining knowledge from text using information extraction. In: SIGKDD Explorations, June 2005, vol. 7, pp. 3–10 (2005)Google Scholar
  9. 9.
    Nartker, T., Taghva, K., Young, R., Borsack, J., Condit, A.: OCR correction based on document level knowledge. In: Proc. IS&T/SPIE 2003 Intl. Symp. on Electronic Imaging Science and Technology, Santa Clara, CA, January 2003, vol. 5010, pp. 103–110 (2003)Google Scholar
  10. 10.
    Rabiner, L., Juang, B.-H.: Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliffs (1993)Google Scholar
  11. 11.
    Taghva, K., Beckley, R., Coombs, J., Borsack, J., Pereda, R., Nartker, T.: Automatic redaction of private information using relational information extraction. In: Proc. IS&T/SPIE 2006 Intl. Symp. on Electronic Imaging Science and Technology (2006) (Submitted)Google Scholar
  12. 12.
    Taghva, K., Borsack, J., Nartker, T.: A process flow for realizing high accuracy for ocr text. In: SDIUT 2006 (2006) (Forthcoming)Google Scholar
  13. 13.
    Taghva, K., Cartright, M.: An efficient tool for XML data preparation. In: Proc. ISNG 2005 Information Systems: New Generations, Las Vegas, NV (April 2005)Google Scholar
  14. 14.
    Taghva, K., Coombs, J., Pereda, R.: Address extraction using hidden markov models. In: Proc. IS&T/SPIE 2005 Intl. Symp. on Electronic Imaging Science and Technology, San Jose, CA (January 2005)Google Scholar
  15. 15.
    Taghva, K., Nartker, T., Borsack, J.: Information access in the presence of OCR errors. In: Proc. of ACM Hardcopy Document Processing Workshop, Washington, DC, November 2004, pp. 1–8 (2004)Google Scholar
  16. 16.
    Taghva, K., Nartker, T.A., Borsack, J.: Recognize, categorize, and retrieve. In: Proc. of the Symposium on Document Image Understanding Technology, Columbia, MD, April 2001, pp. 227–232 (2001), Laboratory for Language and Media Processing, University of MarylandGoogle Scholar
  17. 17.
    Taghva, K., Nartker, T., Borsack, J., Lumos, S., Condit, A., Young, R.: Evaluating text categorization in the presence of OCR errors. In: Proc. IS&T/SPIE 2001 Intl. Symp. on Electronic Imaging Science and Technology, San Jose, CA, January 2001, pp. 68–74 (2001)Google Scholar
  18. 18.
    Taghva, K., Stofsky, E.: Ocrspell: An interactive spelling correction system for OCR errors in text. Intl. Journal on Document Analysis and Recognition 3(3), 125–137 (2001)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Kazem Taghva
    • 1
  • Russell Beckley
    • 1
  • Jeffrey Coombs
    • 1
  1. 1.Information Science Research InstituteUniversity of NevadaLas Vegas

Personalised recommendations