Machine Learning

, Volume 34, Issue 1–3, pp 11–41 | Cite as

Forgetting Exceptions is Harmful in Language Learning

  • Walter Daelemans
  • Antal Van Den Bosch
  • Jakub Zavrel


We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneficial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking. In a first series of experiments we combine memory-based learning with training set editing techniques, in which instances are edited based on their typicality and class prediction strength. Results show that editing exceptional instances (with low typicality or low class prediction strength) tends to harm generalization accuracy. In a second series of experiments we compare memory-based learning and decision-tree learning methods on the same selection of tasks, and find that decision-tree learning often performs worse than memory-based learning. Moreover, the decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness). We provide explanations for both results in terms of the properties of the natural language processing tasks and the learning algorithms.

memory-based learning natural language learning edited nearest neighbor classifier decision-tree learning 


  1. Abney, S. (1991). Parsing by chunks. In Principle-Based Parsing. Kluwer Academic Publishers, Dordrecht.Google Scholar
  2. Aha, D. W. (1992). Generalizing from case studies: a case study. In Proceedings of the Ninth International Conference on Machine Learning, (pp. 1–10). San Mateo, CA: Morgan Kaufmann.Google Scholar
  3. Aha, D. W. (1997). Lazy learning: Special issue editorial. Artificial Intelligence Review, 11, 7–10.Google Scholar
  4. Aha, D. W., Kibler, D. & Albert, M. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66.Google Scholar
  5. Atkeson, C., Moore, A. & Schaal, S. (1997). Locally weighted learning. Artificial Intelligence Review, 11(1–5), 11–73.Google Scholar
  6. Baayen, R. H., Piepenbrock, R. & van Rijn, H. (1993). The CELEX lexical data base on CD-ROM. Linguistic Data Consortium, Philadelphia, PA.Google Scholar
  7. Bod, R. (1995). Enriching linguistics with statistics: Performance models of natural language. Dissertation, ILLC, Universiteit van Amsterdam.Google Scholar
  8. Cardie, C. (1993). A case-based approach to knowledge acquisition for domain-specific sentence analysis. In AAAI-93, (pp. 798–803).Google Scholar
  9. Cardie, C. (1994). Domain Specific Knowledge Acquisition for Conceptual Sentence Analysis. Ph.D. thesis, University of Massachusets, Amherst, MA.Google Scholar
  10. Cardie, C. (1996). Automatic feature set selection for case-based learning of linguistic knowledge. In Proc. of Conference on Empirical Methods in NLP. University of Pennsylvania.Google Scholar
  11. Church, K. W. (1988). A stochastic parts program and noun phrase parser for unrestricted text. In Proc. of Second Applied NLP (ACL).Google Scholar
  12. Collins, M.J & Brooks, J. (1995). Prepositional phrase attachment through a backed-off model. In Proc. of Third Workshop on Very Large Corpora, Cambridge.Google Scholar
  13. Cost, S. & Salzberg, S. (1993). A weighted nearest neighbour algorithm for learning with symbolic features. Machine Learning, 10, 57–78.Google Scholar
  14. Cover, T. M. & Hart, P. E. (1967). Nearest neighbor pattern classification. Institute of Electrical and Electronics Engineers Transactions on Information Theory, 13, 21–27.Google Scholar
  15. Daelemans, W. (1995). Memory-based lexical acquisition and processing. In P. Steffens (Ed.), Machine Translation and the Lexicon, Vol. 898 of Lecture Notes in Artificial Intelligence. Springer-Verlag, Berlin, pages 85–98.Google Scholar
  16. Daelemans, W. (1996). Experience-driven language acquisition and processing. In T. Van der Avoird and C. Corsius (Eds.), Proceedings of the CLS Opening Academic Year 1996-1997. CLS, Tilburg, pages 83–95.Google Scholar
  17. Daelemans, W. & Van den Bosch, A. (1992). Generalisation performance of backpropagation learning on a syllabification task. In M. F. J. Drossaers and A. Nijholt (Eds.), Proc. of TWLT3: Connectionism and Natural Language Processing, pages 27–37, Enschede. Twente University.Google Scholar
  18. Daelemans, W. & Van den Bosch, A. (1996). Language-independent data-oriented grapheme-to-phoneme conversion. In J. P. H. Van Santen, R. W. Sproat, J. P. Olive, and J. Hirschberg (Eds.), Progress in Speech Processing. Springer-Verlag, Berlin, pages 77–89.Google Scholar
  19. Daelemans, W., Van den Bosch, A. & Weijters, A. (1997). IGTree: using trees for compression and classification in lazy learning algorithms. Artificial Intelligence Review, 11, 407–423.Google Scholar
  20. Daelemans, W., Van den Bosch, A. & Zavrel, J. (1997). A feature-relevance heuristic for indexing and compressing large case bases. In M. Van Someren and G. Widmer (Eds.), Poster Papers of the Ninth European Conference on Machine Learing, pages 29–38, Prague, Czech Republic. University of Economics.Google Scholar
  21. Daelemans, W., Zavrel, J., Berck, P. & Gillis, S. (1996). MBT: A memory-based part of speech tagger generator. In E. Ejerhed and I. Dagan (Eds.), Proc. of Fourth Workshop on Very Large Corpora, pages 14–27.Google Scholar
  22. Daelemans, W., Zavrel, J., Van der Sloot, K. & Van den Bosch. A. (1998). TiMBL: Tilburg Memory-Based Learner, version 1.0, reference guide. Technical report, ILK 98-03, Tilburg, The Netherlands.Google Scholar
  23. Dagan, I., Lee, L. & Pereira, F. (1997). Similarity-based methods for word sense disambiguation. In Proceedings of the 35th ACL and the 8th EACL, Madrid, Spain, pages 56–63.Google Scholar
  24. Danyluk, A. P. & Provost, F.J. (1993). Small disjuncts in action: learning to diagnose errors in the local loop of the telephone network. In Proceedings of the Tenth International Conference on Machine Learning, (pp. 81–88). San Mateo, CA. Morgan Kaufmann.Google Scholar
  25. Devijver, P. A. & Kittler, J. (1980). On the edited nearest neighbor rule. In Proceedings of the Fifth International Conference on Pattern Recognition. The Institute of Electrical and Electronics Engineers.Google Scholar
  26. Devijver, P. A. & Kittler, J. (1982). Pattern recognition. A statistical approach. Prentice-Hall, London, UK.Google Scholar
  27. Dietterich, T. G. 1998 (in press). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation.Google Scholar
  28. Dietterich, T. G., Hild, H. & Bakiri, G. (1995). A comparison of id3 and backpropagation for English text-tospeech mapping. Machine Learning, 19(1), 5–28.Google Scholar
  29. Domingos, P. (1996). Unifying instance-based and rule-based induction. Machine Learning, 24, 141–168.Google Scholar
  30. Friedman, J. H., Kohavi, R. & Yun, Y. (1996). Lazy decision trees. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 717–724, Cambridge, MA. The MIT Press.Google Scholar
  31. Hart, P. E. (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 14, 515–516.Google Scholar
  32. Holte, R. C., Acker, L.E. & Porter, B.W. (1989). Concept learning and the problem of small disjuncts. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, (pp. 813–818). San Mateo, CA. Morgan Kaufmann.Google Scholar
  33. Jones, D. (1996). Analogical natural language processing. UCL Press, London, UK.Google Scholar
  34. Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-35, 400–401, March.Google Scholar
  35. Kolodner, J. (1993). Case-based reasoning. Morgan Kaufmann, San Mateo, CA.Google Scholar
  36. Lehnert, W. (1987). Case-based problem solving with a large knowledge base of learned cases. In Proceedings of the Sixth National Conference on Artificial Intelligence (AAAI-87), (pp. 301–306). Los Altos, CA. Morgan Kaufmann.Google Scholar
  37. Magerman, D. M. (1994). Natural language parsing as statistical pattern recognition. Dissertation, Stanford University.Google Scholar
  38. Marcus, M., Santorini, B. & Marcinkiewicz, M.A. (1993). Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2), 313–330.Google Scholar
  39. Markovitch, S. & Scott, P. D. (1988). The role of forgetting in learning. In Proceedings of the Fifth International Conference on Machine Learning, (pp. 459–465). Ann Arbor, MI. Morgan Kaufmann.Google Scholar
  40. Michie, D., Spiegelhalter, D.J. & Taylor, C.C. (1994). Machine learning, neural and statistical classification. Ellis Horwood, New York.Google Scholar
  41. Mooney, R. J. (1996). Comparative experiments on disambiguating word senses: An illustration of the role of bias in machine learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP, (pp. 82–91).Google Scholar
  42. Mooney, R. J. & Califf, M. E. (1995). Induction of first-order decision lists: Results on learning the past tense of english verbs. Journal of Artificial Intelligence Research, 3, 1–24.Google Scholar
  43. Ng, H. T. (1997). Exemplar-based word sense disambiguation: some recent improvements. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, (pp. 208–213).Google Scholar
  44. Ng, H. T. & Lee, H. B. (1996). Intergrating multiple knowledge sources to disambiguate word sense: An exemplar-based approach. In Proc. of 34th meeting of the Assiociation for Computational Linguistics.Google Scholar
  45. Quinlan, J. R. (1991). Improved estimation for the accuracy of small disjuncts. Machine Learning, 6, 93–98.Google Scholar
  46. Quinlan, J.R. (1986). Induction of Decision Trees. Machine Learning, 1, 81–206.Google Scholar
  47. Quinlan, J.R. (1993). c4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.Google Scholar
  48. Ramshaw, L.A. & Marcus, M.P. (1995). Text chunking using transformation-based learning. In Proc. of third workshop on very large corpora, (pp. 82–94).Google Scholar
  49. Ratnaparkhi, A. (1997). A linear observed time statistical parser based on maximum entropy models. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, EMNLP, (pp. 1–10).Google Scholar
  50. Ratnaparkhi, A., Reynar, J. & Roukos, S. (1994). A maximum entropy model for prepositional phrase attachment. In Workshop on Human Language Technology, Plainsboro, NJ, March. ARPA.Google Scholar
  51. Rosch, E. & Mervis, C. B. (1975). Family resemblances: studies in the internal structure of categories. Cognitive Psychology, 7, 573–605.Google Scholar
  52. Salganicoff, M. (1993). Density-adaptive learning and forgetting. In Proceedings of the Fifth International Conference on Machine Learning, (pp. 276–283), Amherst, MA. Morgan Kaufmann.Google Scholar
  53. Salzberg, S. (1990). Learning with nested generalised exemplars. Kluwer Academic Publishers, Norwell, MA.Google Scholar
  54. Sejnowski, T. J. & Rosenberg, C. S. (1987). Parallel networks that learn to pronounce English text. Complex Systems, 1, 145–168.Google Scholar
  55. Shavlik, J.W., Mooney, R. J. & Towell, G. G. (1991). An experimental comparison of symbolic and connectionist learning algorithms. Machine Learning, 6, 111–143.Google Scholar
  56. Stanfill, C. (1987). Memory-based reasoning applied to English pronunciation. In Proceedings of the Sixth National Conference on Artificial Intelligence (scAAAI)-87, (pp. 577–581). Los Altos, CA. Morgan Kaufmann.Google Scholar
  57. Stanfill, C. & Waltz, D. (1986). Toward memory-based reasoning. Communications of the acm, 29(12), 1213–1228, December.Google Scholar
  58. Ting, K. M. (1994a). The problem of atypicality in instance-based learning. In Proceedings of the The Third Pacific Rim International Conference on Artificial Intelligence, (pp. 360–366).Google Scholar
  59. Ting, K. M. (1994b). The problem of small disjuncts: Its remedy in decision trees. In Proceedings of the Tenth Canadian Conference on Artificial Intelligence, (pp. 91–97).Google Scholar
  60. Van den Bosch, A., Daelemans, W. & Weijters, A. (1996). Morphological analysis as classification: an inductivelearning approach. In K. Oflazer and H. Somers (Eds.), Proceedings of the Second International Conference on New Methods in Natural Language Processing, NeMLaP-2, Ankara, Turkey, (pp. 79–89).Google Scholar
  61. Van den Bosch, A., Weijters, A., Van den Herik, H.J. & Daelemans, W. (1995). The profit of learning exceptions. In Proceedings of the 5th Belgian-Dutch Conference on Machine Learning, (pp. 118–126).Google Scholar
  62. Voisin, J. & Devijver, P. A. (1987). An application of the Multiedit-Condensing technique to the reference selection problem in a print recognition system. Pattern Recognition, 5, 465–474.Google Scholar
  63. Weiss, S. & Kulikowski, C. (1991). Computer systems that learn. San Mateo, CA: Morgan Kaufmann.Google Scholar
  64. Wilson, D. (1972). Asymptotic properties of nearest neighbor rules using edited data. Institute of Electrical and Electronic Engineers Transactions on Systems, Man and Cybernetics, 2, 408–421.Google Scholar
  65. Wolpert, D. (1989). Constructing a generalizer superior to NETtalk via mathematical theory of generalization. Neural Networks, 3, 445–452.Google Scholar
  66. Zavrel, J. & Daelemans, W. (1997). Memory-based learning: Using similarity for smoothing. In Proc. of 35th annual meeting of the ACL, Madrid.Google Scholar
  67. Zavrel, J., Daelemans, W. & Veenstra, J. (1997). Resolving pp attachment ambiguities with memory-based learning. In M. Ellison (Ed.), Proc. of the Workshop on Computational Language Learning (CoNLL'97), ACL, Madrid.Google Scholar
  68. Zhang, J. (1992). Selecting typical instances in instance-based learning. In Proceedings of the International Machine Learning Conference 1992, (pp. 470–479).Google Scholar

Copyright information

© Kluwer Academic Publishers 1999

Authors and Affiliations

  • Walter Daelemans
    • 1
  • Antal Van Den Bosch
    • 1
  • Jakub Zavrel
    • 1
  1. 1.ILK / Computational LinguisticsTilburg UniversityTilburgThe Netherlands

Personalised recommendations