Character confusion versus focus word-based correction of spelling and OCR variants in corpora

Open Access
Original Paper

Abstract

We present a new approach based on anagram hashing to handle globally the lexical variation in large and noisy text collections. Lexical variation addressed by spelling correction systems is primarily typographical variation. This is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbors is applied, where near-neighbors are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbors constitutes a particular character confusion. We present a global way of performing this action: for all possible particular character confusions given a particular edit distance, we sequentially identify all the pairs of text strings in the text collection that display a particular confusion. We work on large digitized corpora, which contain lexical variation due to both the OCR process and typographical or typesetting error and show that all these types of variation can be handled equally well in the framework we present. The character confusion-based prototype of Text-Induced Corpus Clean-up (ticcl) is compared to its focus word-based counterpart and evaluated on 6 years’ worth of digitized Dutch Parliamentary documents. The character confusion approach is shown to gain an order of magnitude in speed on its word-based counterpart on large corpora. Insights gained about the useful contribution of global corpus variation statistics are shown to also benefit the more traditional word-based approach to spelling correction. Final tests on a held-out set comprising the 1918 edition of the Dutch daily newspaper ‘Het Volk’ show that the system is not sensitive to domain variation.

References

  1. 1.
    Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)CrossRefGoogle Scholar
  2. 2.
    Kukich K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)CrossRefGoogle Scholar
  3. 3.
    Cucerzan, S., Brill, E.: Spelling correction as an iterative process that exploits the collective knowledge of web users. In: Lin, D., Wu, D. (eds.) Proceedings of EMNLP 2004, pp. 293–300. Association for Computational Linguistics, Barcelona (2004)Google Scholar
  4. 4.
    Lopresti D., Zhou J.: Using consensus sequence voting to correct OCR errors. Comput. Vis. Image Underst. 67(1), 39–47 (1997)CrossRefGoogle Scholar
  5. 5.
    Kernighan, M.D., Church, K.W., Gale, W.A.: A spelling correction program based on a noisy channel model. In: COLING-90, vol. II, pp. 205–211. Helsinki (1990)Google Scholar
  6. 6.
    Oflazer, K., Güzey, C.: Spelling correction in agglutinative languages. In: ANLP, pp. 194–195. (1994)Google Scholar
  7. 7.
    Sun, X., Gao, J., Micol, D., Quirk, C.: Learning phrase-based spelling error models from clickthrough data. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10) (2010)Google Scholar
  8. 8.
    Teahan W.J., Inglis S., Cleary J.G., Holmes G.: Correcting English text using PPM models. In: Storer, J.A., Reif, J.H. (eds) Proc Data Compression Conference, pp. 289–298. IEEE Computer Society Press, Society Press, Los Alamitos, CA (1998)Google Scholar
  9. 9.
    Kolak, O., Resnik, P.: OCR error correction using a noisy channel model. In: Proceedings of the second international conference on Human Language Technology Research, pp. 257–262. Morgan Kaufmann Publishers Inc., San Francisco, CA, (2002)Google Scholar
  10. 10.
    Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting of the ACL, pp. 286–293. (2000)Google Scholar
  11. 11.
    Strohmaier, C.M., Ringlstetter, C., Schulz, K.U., Mihov, S.: Lexical postcorrection of OCR-results: the web as a dynamic secondary dictionary? In: International Conference on Document Analysis and Recognition 2:1133 (2003)Google Scholar
  12. 12.
    Ringlstetter C., Schulz K.U., Mihov S.: Orthographic errors in web pages: toward cleaner web corpora. Comput. Linguist. 32(3), 295–340 (2006)CrossRefGoogle Scholar
  13. 13.
    Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. In: Cybernetics and Control Theory, vol. 10(8), pp. 707–710 (1965), original in: Doklady Nauk SSSR 163(4):845–848 (1965)Google Scholar
  14. 14.
    Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., Schulz, K.U.: Enabling information retrieval on historical document collections: the role of matching procedures and special lexica. In: AND ’09: Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, pp. 69–76. ACM, New York, NY (2009)Google Scholar
  15. 15.
    Reynaert, M.: Text induced spelling correction. In: Proceedings COLING 2004, Geneva (2004)Google Scholar
  16. 16.
    Reynaert, M.: Text-induced spelling correction. PhD thesis, Tilburg University (2005)Google Scholar
  17. 17.
    Damerau F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)CrossRefGoogle Scholar
  18. 18.
    Reynaert, M.: Non-interactive OCR post-correction for giga-scale digitization projects. In: Proceedings of CICLing 2008. Lecture Notes in Computer Science vol. 4919/2008, pp. 617–630. Springer, Berlin (2008)Google Scholar
  19. 19.
    Reynaert, M.: Parallel identification of the spelling variants in corpora. In: Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data 2009 (AND-2009), pp. 77–84. Barcelona, Spain (2009)Google Scholar
  20. 20.
    Frauenfelder U., Baayen R., Hellwig F., Schreuder R.: Neighbourhood density and frequency across languages and modalities. J. Mem. Lang. 32, 781–804 (1993)CrossRefGoogle Scholar
  21. 21.
    Zipf G.K.: The psycho-biology of language: an introduction to dynamic philology, 2nd edn. The M.I.T. Press, Cambridge, MA (1935)Google Scholar
  22. 22.
    van Rijsbergen C.J.: Information Retrieval. Butterworths, London (1975)Google Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  1. 1.Tilburg Centre for Cognition and CommunicationTilburg UniversityTilburgNetherlands

Personalised recommendations