Advertisement

Non-interactive OCR Post-correction for Giga-Scale Digitization Projects

  • Martin Reynaert
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4919)

Abstract

This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce ’tickle’) focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Levenshtein distance (henceforth: ld). Simple text-induced filtering techniques help to retain as many as possible of the true positives and to discard as many as possible of the false positives. ticcl has been evaluated on a contemporary OCR-ed Dutch text corpus and on a corpus of historical newspaper articles, whose OCR-quality is far lower and which is in an older Dutch spelling. Representative samples of typographical variants from both corpora have allowed us not only to properly evaluate our system, but also to draw effective conclusions towards the adaptation of the adopted correction mechanism to OCR-error resolution. The performance scores obtained up to ld 2 mean that the bulk of undesirable OCR-induced typographical variation present can fully automatically be removed.

Keywords

Word Type Levenshtein Distance Text Collection Word String Vocabulary Growth 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baayen, R.H.: The effects of lexical specialization on the growth curve of the vocabulary. Computational Linguistics 22, 455–480 (1996)Google Scholar
  2. 2.
    Evert, S., Baroni, M.: zipfR: Word frequency distributions in R. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Session, Prague, Czech Republic (2007)Google Scholar
  3. 3.
    Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  4. 4.
    Reynaert, M.: Corpus-Induced Corpus Clean-up. In: LREC 2006: Fifth International Conference on Language Resources and Evaluation, Magazzini del Cotone Conference Center – Genova, Italy, Paris, ELRA, European Language Resources Association (2006)Google Scholar
  5. 5.
    Pollock, J., Zamora, A.: Collection and characterization of spelling errors in scientific and scholarly text. Journal of the American Society for Information Science 34, 51–58 (1983)CrossRefGoogle Scholar
  6. 6.
    Damerau, F.J.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7, 171–176 (1964)CrossRefGoogle Scholar
  7. 7.
    Schneider, P.: Computer assisted spelling normalization of 18th century English. Language and Computers 36, 199–211(13) (2001)Google Scholar
  8. 8.
    Ernst-Gerlach, A., Fuhr, N.: Retrieval in text collections with historic spelling using linguistic and spelling variants. In: JCDL 2007: Proceedings of the 2007 conference on Digital libraries, pp. 333–341. ACM Press, New York (2007)CrossRefGoogle Scholar
  9. 9.
    Pilz, T., et al.: Rule-based search in text databases with nonstandard orthography. Literary and Linguistic Computing 21, 179–186 (2006)CrossRefGoogle Scholar
  10. 10.
    Hauser, A., et al.: Information access to historical documents from the Early New High German Period. In: Burnard, L., et al. (eds.) Digital Historical Corpora - Architecture, Annotation, and Retrieval. Dagstuhl Seminar Proceedings, Dagstuhl, Germany, IBFI (2007)Google Scholar
  11. 11.
    Adriaans, F., et al.: A cross-language approach to historic document retrieval. In: Lalmas, M., et al. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 407–419. Springer, Heidelberg (2006)Google Scholar
  12. 12.
    Taghva, K., Stofsky, E.: OCRSpell: an interactive spelling correction system for OCR errors in text. International Journal on Document Analysis and Recognition 3, 125–137 (2001)CrossRefGoogle Scholar
  13. 13.
    Strohmaier, C.M., et al.: A visual and interactive tool for optimizing lexical postcorrection of OCR-results. IEEE Computer Society, Los Alamitos (2003)Google Scholar
  14. 14.
    Mihov, S., Schulz, K.U.: Fast approximate search in large dictionaries. Journal of Computational Linguistics 30, 451–477 (2004)CrossRefMathSciNetGoogle Scholar
  15. 15.
    Mihov, S., et al.: Tuning the selection of correction candidates for garbled tokens using error dictionaries. In: Finite State Techniques and Approximate Search, Proceedings of the First Workshop on Finite-State Techniques and Approximate Search, Borovets, Bulgaria, pp. 25–30 (2007)Google Scholar
  16. 16.
    Ringlstetter, C., Schulz, K.U., Mihov, S.: Orthographic errors in web pages: Toward cleaner web corpora. Computational Linguistics 32, 295–340 (2006)CrossRefGoogle Scholar
  17. 17.
    Kolak, O., Resnik, P.: OCR post-processing for low density languages. In: HLT 2005: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 867–874. Association for Computational Linguistics, Morristown, NJ, USA (2005)CrossRefGoogle Scholar
  18. 18.
    Reynaert, M.: Text-Induced Spelling Correction. PhD thesis, Tilburg University (2005)Google Scholar
  19. 19.
    Lopresti, D.: Performance evaluation for text processing of noisy inputs. In: SAC 2005: Proceedings of the 2005 ACM symposium on Applied Computing, pp. 759–763. ACM Press, New York (2005)Google Scholar
  20. 20.
    van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1975)Google Scholar
  21. 21.
    Zipf, G.K.: The psycho-biology of language: an introduction to dynamic philology, 2nd edn. The MIT. Press, Cambridge (1935)Google Scholar
  22. 22.
    Frauenfelder, U., et al.: Neighbourhood density and frequency across languages and modalities. Journal of Memory and Language 32, 781–804 (1993)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Martin Reynaert
    • 1
  1. 1.Induction of Linguistic KnowledgeTilburg UniversityThe Netherlands

Personalised recommendations