The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification

  • Mayo Murata
  • Lazaro S. P. Busagala
  • Wataru Ohyama
  • Tetsushi Wakabayashi
  • Fumitaka Kimura
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3872)

Abstract

Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.

Keywords

Feature Vector Classification Rate Optical Character Recognition Feature Transformation Linear Discriminant Function 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Ohta, M., Takasu, A., Adachi, J.: Retrieval Methods for English-Text with Missrecognized OCR Characters. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition (ICDAR), Ulm, Germany, August 18-20, pp. 950–956 (1997)Google Scholar
  2. 2.
    Myka, A.: Measuring the Effects of OCR Errors on Similarity Linking. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition (ICDAR), Ulm, Germany, August 18-20, pp. 968–973 (1997)Google Scholar
  3. 3.
    Zu, G., Murata, M., Ohyama, W., Wakabayashi, T., Kimura, F.: The impact of OCR accuracy on Automatic Text Categorization. In: Proceedings of Advanced Workshop Content Computing, pp. 403–409 (2004)Google Scholar
  4. 4.
    Fukumoto, T., Wakabayashi, T., Kimura, F., Miyake, Y.: Accuracy Improvement of Handwritten Character Recognition By GLVQ. In: Proceedings of the Seventh International Workshop on Frontiers in Handwriting Recognition Proceedings(IWFHR VII), September 2000, pp. 271–280 (2000)Google Scholar
  5. 5.
    Zu, G., Ohyama, W., Wakabayashi, T., Kimura, F.: Accuracy improvement of automatic text classification based on feature transformation. In: DocEng 2003 (ACM Symposium on Document Engineering 2003), Grenoble, France, November 20–22, pp. 118–120 (2003)Google Scholar
  6. 6.
    Chang, C.C., Lin, C.J.: LIBSVM – A Library for Support Vector Machines, Version 2.33 (March 2002), http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html
  7. 7.
    Library Digital Initiative Project Team, Harvad University Library: Measuring Search Retrieval Accuracy of uncorrected OCR: Findings from the Harvad-Radcliffe Online Historical Reference Shelf Digitization Project, A research report (August 2001), available at http://preserve.harvard.edu/resources/ocr_report.pdf
  8. 8.
    Bicknes, D.A.: Measuring the accuracy of the OCR in the Making of America. A research report (1998), available at http://www.hti.umich.edu/m/moagrp/moaocr.html
  9. 9.
    Junker, M., Hoch, R.: An experimental evaluation of OCR text representations for learning document classifiers. International Journal on Document Analysis and Recognition, 116–122 (1998)Google Scholar
  10. 10.
    Frasconi, P., Soda, G., Vullo, A.: Text Categorization for Multi-page Documents: A Hybrid Naïve Bayes HMM Approach. In: 1st ACM-IEEE Joint Conference on Digital Libraries(JCDL 2001), Roanoke Virginia (2001)Google Scholar
  11. 11.
    Frasconi, P., Soda, G., Vullo, A.: Hidden Markov Models for Text Categorization in Mult-page Documents. Journal of Intelligent Information Systems 18(2/3), 195–217 (2002)CrossRefGoogle Scholar
  12. 12.
    Taghva, K., Nartker, T., Borsack, J., Lumos, S., Condit, A., Young, R.: Evaluating Text Categorization in the Presence of OCR Errors. In: Proceedings of the Symposium on Electronic Imaging Science and Technology, San Jose, CA, January 2001, pp. 68–74 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Mayo Murata
    • 1
  • Lazaro S. P. Busagala
    • 1
  • Wataru Ohyama
    • 1
  • Tetsushi Wakabayashi
    • 1
  • Fumitaka Kimura
    • 1
  1. 1.Faculty of EngineeringMie UniversityMieJapan

Personalised recommendations