Machine Learning

, Volume 34, Issue 1–3, pp 107–130 | Cite as

A Winnow-Based Approach to Context-Sensitive Spelling Correction

  • Andrew R. Golding
  • Dan Roth


A large class of machine-learning problems in natural language require the characterization of linguistic context. Two characteristic properties of such problems are that their feature space is of very high dimensionality, and their target concepts depend on only a small subset of the features in the space. Under such conditions, multiplicative weight-update algorithms such as Winnow have been shown to have exceptionally good theoretical properties. In the work reported here, we present an algorithm combining variants of Winnow and weighted-majority voting, and apply it to a problem in the aforementioned class: context-sensitive spelling correction. This is the task of fixing spelling errors that happen to result in valid words, such as substituting to for too, casual for causal, and so on. We evaluate our algorithm, WinSpell, by comparing it against BaySpell, a statistics-based method representing the state of the art for this task. We find: (1) When run with a full (unpruned) set of features, WinSpell achieves accuracies significantly higher than BaySpell was able to achieve in either the pruned or unpruned condition; (2) When compared with other systems in the literature, WinSpell exhibits the highest performance; (3) While several aspects of WinSpell's architecture contribute to its superiority over BaySpell, the primary factor is that it is able to learn a better linear separator than BaySpell learns; (4) When run on a test set drawn from a different corpus than the training set was drawn from, WinSpell is better able than BaySpell to adapt, using a strategy we will present that combines supervised learning on the training set with unsupervised learning on the (noisy) test set.

Winnow multiplicative weight-update algorithms context-sensitive spelling correction Bayesian classifiers 


  1. Blum, A. (1992). Learning boolean functions in an infinite attribute space. Machine Learning, 9(4), 373–386.Google Scholar
  2. Breiman, L. (1994). Bagging predictors (Technical Report 421). University of California, Berkeley.Google Scholar
  3. Cesa-Bianchi, N., Freund, Y., Helmbold, D.P., & Warmuth, M.(1994). On-line prediction and conversion strategies. Computational Learning Theory: Eurocolt'93 (pp. 205–216). Oxford University Press.Google Scholar
  4. Chen, S.F., & Goodman, J. (1996). An empirical study of smoothing techniques for language modeling. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, CA.Google Scholar
  5. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297.Google Scholar
  6. Dagan, I., Karov, Y., & Roth, D. (1997). Mistake-driven learning in text categorization. EMNLP-97, The Second Conference on Empirical Methods in Natural Language Processing (pp. 55–63).Google Scholar
  7. Dietterich, T.G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895–1924.Google Scholar
  8. Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103–130.Google Scholar
  9. Duda, R.O., & Hart, P.E. (1973). Pattern classification and scene analysis. Wiley.Google Scholar
  10. Fleiss, J.L. (1981). Statistical methods for rates and proportions. John Wiley and Sons.Google Scholar
  11. Flexner, S.B. (Ed.). (1983). Random House unabridged dictionary (2nd ed.). New York: Random House.Google Scholar
  12. Freund, Y., & Schapire, R.E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. Computational Learning Theory: Eurocolt'95 (pp. 23–37). Springer-Verlag.Google Scholar
  13. Gale, W.A., Church, K.W., & Yarowsky, D. (1993). A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26, 415–439.Google Scholar
  14. Golding, A.R. (1995). A Bayesian hybrid method for context-sensitive spelling correction. Proceedings of the 3rd Workshop on Very Large Corpora, Boston, MA.Google Scholar
  15. Golding, A.R., & Schabes, Y. (1996). Combining trigram-based and feature-based methods for context-sensitive spelling correction. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, CA.Google Scholar
  16. Herbster, M., & Warmuth, M. (1995). Tracking the best expert. Proceedings of the 12th International Conference on Machine Learning (pp. 286–294). Morgan Kaufmann.Google Scholar
  17. Holte, R.C., Acker, L.E., & Porter, B.W. (1989). Concept learning and the problem of small disjuncts. Proceedings of the International Joint Conference on Artificial Intelligence, Detroit.Google Scholar
  18. Jones, M.P., & Martin, J.H. (1997). Contextual spelling correction using latent semantic analysis. Proceedings of the 5th Conference on Applied Natural Language Processing, Washington, DC.Google Scholar
  19. Katz, S.M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. on Acoustics, Speech, and Signal Processing, ASSP-35(3), 400–401.Google Scholar
  20. Kivinen, J., & Warmuth, M.K. (1995). Exponentiated gradient versus gradient descent for linear predictors. ACM Symp. on the Theory of Computing.Google Scholar
  21. Kneser, R., & Ney, H. (1995). Improved backing-off for m-gram language modeling. Proceedings of the International Conf. on Acoustics, Speech, and Signal Processing (Vol. 1, pp. 181–184).Google Scholar
  22. Kohavi, R., Becker, B., & Sommerfield, D. (1997). Improving simple Bayes. Proceedings of the European Conference on Machine Learning.Google Scholar
  23. KuLcera, H., & Francis, W.N. (1967). Computational analysis of present-day American English. Providence, RI: Brown University Press.Google Scholar
  24. Kukich, K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4), 377–439.Google Scholar
  25. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285–318.Google Scholar
  26. Littlestone, N. (1991). Redundant noisy attributes, attribute errors, and linear threshold learning using Winnow. Proceedings of the 4th AnnualWorkshop on Computational Learning Theory (pp. 147–156). Morgan Kaufmann.Google Scholar
  27. Littlestone, N. (1995). Comparing several linear-threshold learning algorithms on tasks involving superfluous attributes. Proceedings of the 12th International Conference on Machine Learning (pp. 353–361). Morgan Kaufmann.Google Scholar
  28. Littlestone, N., & Warmuth, M.K. (1994). The weighted majority algorithm. Information and Computation, 108(2), 212–261.Google Scholar
  29. Mangu, L., & Brill, E. (1997). Automatic rule acquisition for spelling correction. Proceedings of the 14th International Conference on Machine Learning. Morgan Kaufmann.Google Scholar
  30. Marcus, M.P., Santorini, B., & Marcinkiewicz, M. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.Google Scholar
  31. Mays, E., Damerau, F.J., & Mercer, R.L. (1991). Context based spelling correction. Information Processing and Management, 27(5), 517–522.Google Scholar
  32. Ney, H., Essen, U., & Kneser, R. (1994). On structuring probabilistic dependences in stochastic language modelling. Computer Speech and Language, 8, 1–38.Google Scholar
  33. Ng, H.T., & Lee, H.B. (1996). Integrating multiple knowledge sources to disambiguate word sense: An examplarbased approach. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, CA.Google Scholar
  34. Powers, D. (1997). Learning and application of differential grammars. Proceedings of the ACL Special Interest Group in Natural Language Learning, Madrid.Google Scholar
  35. Reddy, L., & Tadepalli, P. (1997). Active learning with committees for text categorization. Proceedings of the National Conference on Artificial Intelligence (pp. 602–608).Google Scholar
  36. Roth, D. (1998). Learning to resolve natural language ambiguities: Aunified approach. Proceedings of the National Conference on Artificial Intelligence (pp. 806–813).Google Scholar
  37. Roth, D., & Zelenko, D. (1998). Part of speech tagging using a network of linear separators. COLING-ACL 98, The 17th International Conference on Computational Linguistics (pp. 1136–1142).Google Scholar
  38. Valiant, L.G. (1994). Circuits of the mind. Oxford University Press.Google Scholar
  39. Valiant, L.G. (1995). Rationality. Workshop on Computational Learning Theory (pp. 3–14).Google Scholar
  40. Yarowsky, D. (1994). Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM.Google Scholar

Copyright information

© Kluwer Academic Publishers 1999

Authors and Affiliations

  • Andrew R. Golding
    • 1
  • Dan Roth
    • 2
  1. 1.MERL—A Mitsubishi Electric Research LaboratoryCambridge
  2. 2.Department of Computer ScienceUniversity of Illinois—Urbana/ChampaignUrbana

Personalised recommendations