Advertisement

Optical character recognition errors and their effects on natural language processing

  • Daniel LoprestiEmail author
Original Paper

Abstract

Errors are unavoidable in advanced computer vision applications such as optical character recognition, and the noise induced by these errors presents a serious challenge to downstream processes that attempt to make use of such data. In this paper, we apply a new paradigm we have proposed for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Our methodology formulates error classification as an optimization problem solvable using a hierarchical dynamic programming approach. Errors and their cascading effects are isolated and analyzed as they travel through the pipeline. We present experimental results based on a large collection of scanned pages to study the varying impact depending on the nature of the error and the character(s) involved. This dataset has also been made available online to encourage future investigations.

Keywords

Performance evaluation Optical character recognition Sentence boundary detection Tokenization Part-of-speech tagging 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Blando, L.R., Kanai, J., Nartker, T.A.: Prediction of OCR accuracy using simple image features. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, pp. 319–322, Montréal, Canada, August (1995)Google Scholar
  2. 2.
    Cannon, M., Hochberg, J., Kelly, P.: Quality assessment and restoration of typewritten document images. Technical Report LA-UR 99-1233, Los Alamos National Laboratory (1999)Google Scholar
  3. 3.
    Esakov, J., Lopresti, D.P., Sandberg, J.S.: Classification and distribution of optical character recognition errors. In: Proceedings of Document Recognition I (IS&T/SPIE Electronic Imaging), vol. 2181, pp. 204–216, San Jose, February (1994)Google Scholar
  4. 4.
    Esakov, J., Lopresti, D.P., Sandberg, J.S., Zhou, J.: Issues in automatic OCR error classification. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 401–412, April (1994)Google Scholar
  5. 5.
    Farooq, F., Al-Onaizan, Y. : Effect of degraded input on statistical machine translation. In: Proceedings of the Symposium on Document Image Understanding Technology, pp. 103–109, November (2005)Google Scholar
  6. 6.
    Foster, J.: Treebanks gone bad: generating a treebank of ungrammatical English. In: Proceedings of the Workshop on Analytics for Noisy Unstructured Text Data, Hyderabad, India, January (2007)Google Scholar
  7. 7.
    Govindaraju, V., Srihari, S.N.: Assessment of image quality to predict readability of documents. In: Proceedings of Document Recognition III (IS&T/SPIE Electronic Imaging), vol. 2660, pp. 333–342, San Jose, January (1996)Google Scholar
  8. 8.
    Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Medium-independent table detection. In: Lopresti, D.P., Zhou, J. (eds.) Proceedings of Document Recognition and Retrieval VII (IS&T/SPIE Electronic Imaging), vol. 3967, pp. 291–302, San Jose, January (2000)Google Scholar
  9. 9.
    Hu J., Kashi R., Lopresti D., Wilfong G.: Evaluating the performance of table processing algorithms. Int. J. Document Anal. Recogn. 4(3), 140–153 (2002)CrossRefGoogle Scholar
  10. 10.
    Jing, H., Lopresti, D., Shih, C.: Summarizing noisy documents. In: Proceedings of the Symposium on Document Image Understanding Technology, pp. 111–119, April (2003)Google Scholar
  11. 11.
    Lewis, D.D.: Reuters-21578 Test Collection, Distribution 1.0, May (2008). http://www.daviddlewis.com/resources/testcollections/reuters21578/
  12. 12.
    Lopresti, D.: Performance evaluation for text processing of noisy inputs. In: Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), pp. 759–763, Santa Fe, March (2005)Google Scholar
  13. 13.
    Lopresti, D.: Measuring the impact of character recognition errors on downstream text analysis. In: Proceedings of Document Recognition and Retrieval XV (IS&T/SPIE Electronic Imaging), vol. 6815, pp. 0G.01–0G.11, San Jose, January (2008)Google Scholar
  14. 14.
    Lopresti, D.: Noisy OCR text dataset, May 2008. http://www.cse.lehigh.edu/~lopresti/noisytext.html
  15. 15.
    Lopresti, D.: Optical character recognition errors and their effects on natural language processing. In: Proceedings of the Workshop on Analytics for Noisy Unstructured Text Data, pp. 9–16, Singapore, July (2008)Google Scholar
  16. 16.
    MacIntyre, R.: Penn Treebank tokenizer (sed script source code) (1995). http://www.cis.upenn.edu/~treebank/tokenizer.sed
  17. 17.
    Miller, D., Boisen, S., Schwartz, R., Stone, R., Weischedel, R.: Named entity extraction from noisy input: Speech and OCR. In: Proceedings of the 6th Applied Natural Language Processing Conference, pp. 316–324, Seattle, (2000)Google Scholar
  18. 18.
    Palmer D.D., Ostendorf, M.: Improving information extraction by modeling errors in speech recognizer output. In: Allan, J. (ed.) Proceedings of the 1st International Conference on Human Language Technology Research (2001)Google Scholar
  19. 19.
    Ratnaparkhi, A.: A maximum entropy part-of-speech tagger. In: Proceedings of the Empirical Methods in Natural Language Processing Conference, May (1996). ftp://ftp.cis.upenn.edu/pub/adwait/jmx/jmx.tar.gz
  20. 20.
    Reynar, J.C. Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the 5th Conference on Applied Natural Language Processing, Washington, DC, March–April (1997). ftp://ftp.cis.upenn.edu/pub/adwait/jmx/jmx.tar.gz
  21. 21.
    Second workshop on analytics for noisy unstructured text data. Singapore, July (2008). http://and2008workshop.googlepages.com/
  22. 22.
    Taghva K., Borsack J., Condit A.: Effects of OCR errors on ranking and feedback using the vector space model. Inf. Process. Manag. 32(3), 317–327 (1996)CrossRefGoogle Scholar
  23. 23.
    Tesseract open source OCR engine, May (2008). http://code.google.com/p/tesseract-ocr/
  24. 24.
    Third workshop on analytics for noisy unstructured text data. Barcelona, July (2009). http://and2009workshop.googlepages.com/
  25. 25.
    Workshop on analytics for noisy unstructured text data. Hyderabad, India, January (2007). http://research.ihost.com/and2007/

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringLehigh UniversityBethlehemUSA

Personalised recommendations