Advertisement

Successfully detecting and correcting false friends using channel profiles

  • Ulrich Reffle
  • Annette Gotscharek
  • Christoph RinglstetterEmail author
  • Klaus U. Schulz
Original Paper

Abstract

The detection and correction of false friends—also called real-word errors—is a notoriously difficult problem. On realistic data, the break-even point for automatic correction so far could not be reached: the number of additional infelicitous corrections outnumbered the useful corrections. We present a new approach where we first compute a profile of the error channel for the given text. During the correction process, the profile (1) helps to restrict attention to a small set of “suspicious” lexical tokens of the input text where it is “plausible” to assume that the token represents a false friend. In this way, recognition of false friends is improved. Furthermore, the profile (2) helps to isolate the “most promising” correction suggestion for “suspicious” tokens. Using a conventional word trigram statistics for disambiguation we obtain a correction method that can be successfully applied to unrestricted text. In experiments for OCR documents, we show significant accuracy gains by fully automatic correction of false friends.

Keywords

False friends Error correction Error dictionaries 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bolshakov, I.A., Gelbukh, A.F.: On detection of malapropisms by multistage collocation testing. In: 8th International Conference on Applications of Natural Language to Information Systems, pp. 28–41. Burg (Spreewald), Germany (2003)Google Scholar
  2. 2.
    Brants, T., Franz, A.: Web 1t 5-gram version 1. Linguistic Data Consortium, Philadelphia (2006)Google Scholar
  3. 3.
    Dengel A., Hoch R., Hönes F., Jäger T., Malburg M., Weigel A.: Techniques for improving OCR results. In: Bunke, H., Wang, P.S. (eds) Handbook of Character Recognition and Document Image Analysis, pp. 227–258. World Scientific, New Jersey (1997)Google Scholar
  4. 4.
    Gale, W.A., Church, K.W., Yarowsky, D.: Discrimination decisions for 100,000-dimensional spaces. Current Issues in Computational Linguistics: In Honour of Don Walker, pp. 429–450 (1994)Google Scholar
  5. 5.
    Golding, A.R.: A bayesian hybrid method for context-sensitive spelling correction, pp. 39–53 (1995)Google Scholar
  6. 6.
    Golding, A.R., Roth, D.: A winnow-based approach to context-sensitive spelling correction. Machine Learning, pp. 107–130 (1999)Google Scholar
  7. 7.
    Hirst G., Budanitsky A.: Correcting real-word spelling errors by restoring lexical cohesion. Nat. Lang. Eng. 11(1), 87–111 (2005)CrossRefGoogle Scholar
  8. 8.
    Kantor P.B., Voorhees E.M.: The TREC-5 confusion track: comparing retrieval methods for scanned text. Inf. Retr. 2(2/3), 165–176 (2000)CrossRefGoogle Scholar
  9. 9.
    Kuĉera H., Francis W.N.: Computational Analysis of Present-Day American English. Brown University Press, Providence (1967)Google Scholar
  10. 10.
    Kukich, K.: Techniques for automatically correcting words in texts. ACM Computing Surveys, pp. 377–439 (1992)Google Scholar
  11. 11.
    Lopresti, D.: Performance evaluation for text processing of noisy inputs. In: SAC ’05: Proceedings of the 2005 ACM symposium on Applied computing, pp. 759–763. ACM Press, New York (2005)Google Scholar
  12. 12.
    Mays E., Damerau F.J., Mercer R.L.: Context based spelling correction. Inf. Process. Manage. 27(5), 517–522 (1991)CrossRefGoogle Scholar
  13. 13.
    Mihov, S., Mitankin, P., Gotscharek, A., Reffle, U., Schulz, K.U., Ringlstetter, C.: Using automated error profiling of texts for improved selection of correction candidates for garbled tokens. In: Australian Conference on Artificial Intelligence (AI2007). Lecture Notes in Computer Science, vol. 4830, pp. 456–465 (2007)Google Scholar
  14. 14.
    Mitton R.: Spelling checkers, spelling correctors and the misspellings of poor spellers. Inf. Process. Manage. 23(5), 495–505 (1987)CrossRefGoogle Scholar
  15. 15.
    Reffle, U., Gotscharek, A., Ringlstetter, C., Schulz, K.U.: Successfully detecting and correcting false friends using channel profiles. In: AND ’08: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, pp. 17–22 (2008)Google Scholar
  16. 16.
    Reynaert, M.: All, and only, the errors: more complete and consistent spelling and ocr-error correction evaluation. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC) (2008)Google Scholar
  17. 17.
    Ringlstetter, C., Reffle, U., Gotscharek, A., Schulz, K.U: Deriving symbol dependent edit weights for text correction—the use of error dictionaries. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR), pp. 639–643 (2007)Google Scholar
  18. 18.
    Taghva K., Borsack J., Condit A.: Effects of OCR errors on ranking and feedback using the vector space model. Inf. Process. Manag. 32(3), 317–327 (1996)CrossRefGoogle Scholar
  19. 19.
    Taghva K., Borsack J., Condit A., Erva S.: The effects of noisy data on text retrieval. J. Am. Soc. Inf. Sci. 45, 50–58 (1994)CrossRefGoogle Scholar
  20. 20.
    Wilcox-O’Hearn, L.A., Hirst, G., Budanitsky, A.: Real-word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer model. In: Proc., 9th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2008), pp. 605–616. Haifa (2008)Google Scholar
  21. 21.
    Yarowsky, D.: Decision lists for lexical ambiguity resolution: application to accent restoration in spanish and french. In: Proc. of the Meeting of the Association for Computational Linguistics, pp. 88–95. ACL Press, USA (1994)Google Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  • Ulrich Reffle
    • 1
  • Annette Gotscharek
    • 1
  • Christoph Ringlstetter
    • 1
    Email author
  • Klaus U. Schulz
    • 1
  1. 1.CIS, University of MunichMunichGermany

Personalised recommendations