Data Mining and Knowledge Discovery

, Volume 3, Issue 4, pp 409–425 | Cite as

The Role of Occam's Razor in Knowledge Discovery

  • Pedro Domingos
Article

Abstract

Many KDD systems incorporate an implicit or explicit preference for simpler models, but this use of “Occam's razor” has been strongly criticized by several authors (e.g., Schaffer, 1993; Webb, 1996). This controversy arises partly because Occam's razor has been interpreted in two quite different ways. The first interpretation (simplicity is a goal in itself) is essentially correct, but is at heart a preference for more comprehensible models. The second interpretation (simplicity leads to greater accuracy) is much more problematic. A critical review of the theoretical arguments for and against it shows that it is unfounded as a universal principle, and demonstrably false. A review of empirical evidence shows that it also fails as a practical heuristic. This article argues that its continued use in KDD risks causing significant opportunities to be missed, and should therefore be restricted to the comparatively few applications where it is appropriate. The article proposes and reviews the use of domain constraints as an alternative for avoiding overfitting, and examines possible methods for handling the accuracy–comprehensibility trade-off.

model selection overfitting multiple comparisons comprehensible models domain knowledge 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abu-Mostafa, Y.S. 1989. Learning from hints in neural networks. Journal of Complexity, 6:192-198.Google Scholar
  2. Akaike, H. 1978. A Bayesian analysis of the minimum AIC procedure. Annals of the Institute of Statistical Mathematics, 30A:9-14.Google Scholar
  3. Andrews, R. and Diederich, J. (Eds.). 1996. Proceedings of the NIPS-96 Workshop on Rule Extraction from Trained Artificial Neural Networks, Snowmass, CO: NIPS Foundation.Google Scholar
  4. Bernardo, J.M. and Smith, A.F.M. 1994. Bayesian Theory. New York, NY: Wiley.Google Scholar
  5. Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Oxford, UK: Oxford University Press.Google Scholar
  6. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M.K. 1987. Occam's razor. Information Processing Letters, 24:377-380.Google Scholar
  7. Breiman, L. 1996. Bagging predictors. Machine Learning, 24:123-140.Google Scholar
  8. Breiman, L. and Shang, N. 1997. Born again trees. Technical Report, Berkeley, CA: Statistics Department, University of California at Berkeley.Google Scholar
  9. Brunk, C., Kelly, J., and Kohavi, R. 1997. MineSet: An integrated system for data mining. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 135-138.Google Scholar
  10. Cestnik, B. and Bratko, I. 1988. Learning redundant rules in noisy domains. Proceedings of the Eighth European Conference on Artificial Intelligence, Munich, Germany: Pitman, pp. 348-356.Google Scholar
  11. Cheeseman, P. 1990. On finding the most probable model. In Computational Models of Scientific Discovery and Theory Formation, J. Shrager and P. Langley (Eds.). San Mateo, CA: Morgan Kaufmann, pp. 73-95.Google Scholar
  12. Chickering, D.M. and Heckerman, D. 1997. Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables. Machine Learning, 29:181-212.Google Scholar
  13. Clark, P. and Matwin, S. 1993. Using qualitative models to guide inductive learning. Proceedings of the Tenth International Conference on Machine Learning, Amherst, MA: Morgan Kaufmann, pp. 49-56.Google Scholar
  14. Clearwater, S. and Provost, F. 1990. RL4: A tool for knowledge-based induction. Proceedings of the Second IEEE International Conference on Tools for Artificial Intelligence, San Jose, CA: IEEE Computer Society Press, pp. 24-30.Google Scholar
  15. Cohen, W.W. 1994. Grammatically biased learning: Learning logic programs using an explicit antecedent description language. Artificial Intelligence, 68:303-366.Google Scholar
  16. Cohen, W.W. 1995. Fast effective rule induction. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 115-123.Google Scholar
  17. Cooper, G.F. 1997. A simple constraint-based algorithm for efficiently mining observational databases for causal relationships. Data Mining and Knowledge Discovery, 1:203-224.Google Scholar
  18. Cover, T.M. and Thomas, J.A. 1991. Elements of Information Theory. New York, NY: Wiley.Google Scholar
  19. Craven, M.W. 1996. Extracting comprehensible models from trained neural networks. Unpublished doctoral dissertation, Department of Computer Sciences, University of Wisconsin—Madison, Madison, WI.Google Scholar
  20. Datta, P. and Kibler, D. 1995. Learning prototypical concept descriptions. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 158-166.Google Scholar
  21. Djoko, S., Cook, D.J., and Holder, L.B. 1995. Analyzing the benefits of domain knowledge in substructure discovery. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Montréal, Canada: AAAI Press, pp. 75-80.Google Scholar
  22. Domingos, P. 1996a. Two-way induction. International Journal on Artificial Intelligence Tools, 5:113-125.Google Scholar
  23. Domingos, P. 1996b. Unifying instance-based and rule-based induction. Machine Learning, 24:141-168.Google Scholar
  24. Domingos, P. 1997a. Knowledge acquisition from examples via multiple models. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 98-106.Google Scholar
  25. Domingos, P. 1997b. Why does bagging work? A Bayesian account and its implications. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 155-158.Google Scholar
  26. Domingos, P. 1998a. A process-oriented heuristic for model selection. Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI: Morgan Kaufmann, pp. 127-135.Google Scholar
  27. Domingos, P. 1998b. When (and how) to combine predictive and causal learning. Proceedings of the NIPS-98 Workshop on Integrating Supervised and Unsupervised Learning, Breckenridge, CO: NIPS Foundation.Google Scholar
  28. Domingos, P. 1999. Process-oriented estimation of generalization error. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, Stockholm, Sweden: Morgan Kaufmann.Google Scholar
  29. Domingos, P. and Pazzani, M. 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103-130.Google Scholar
  30. Donoho, S. and Rendell, L. 1996. Constructive induction using fragmentary knowledge. Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy: Morgan Kaufmann, pp. 113-121.Google Scholar
  31. Drucker, H., Cortes, C., Jackel, L.D., LeCun, Y., and Vapnik, V. 1994. Boosting and other machine learning algorithms. Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ: Morgan Kaufmann, pp. 53-61.Google Scholar
  32. Edgington, E.S. 1980. Randomization Tests. New York, NY: Marcel Dekker.Google Scholar
  33. Elomaa, T. 1994. In defense of C4.5: Notes on learning one-level decision trees. Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ: Morgan Kaufmann, pp. 62-69.Google Scholar
  34. Fisher, D.H. and Schlimmer, J.C. 1988. Concept simplification and prediction accuracy. Proceedings of the Fifth International Conference on Machine Learning, Ann Arbor, MI: Morgan Kaufmann, pp. 22-28.Google Scholar
  35. Freund, Y. and Schapire, R.E. 1996. Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy: Morgan Kaufmann, pp. 148-156.Google Scholar
  36. Friedman, J.H. 1997. On bias, variance, 0/1—loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1:55-77.Google Scholar
  37. Gams, M. 1989. New measurements highlight the importance of redundant knowledge. Proceedings of the Fourth European Working Session on Learning, Montpellier, France: Pitman, pp. 71-79.Google Scholar
  38. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Computation, 4:1-58.Google Scholar
  39. Grove, A.J. and Schuurmans, D. 1998. Boosting in the limit: Maximizing the margin of learned ensembles. Proceedings of the Fifteenth National Conference on Artificial Intelligence, Madison, WI: AAAI Press, pp. 692-699.Google Scholar
  40. Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K., Li, D., Lu, Y., Rajan, A., Stefanovic, N., Xia, B., and Zaiane, O. 1996. DB Miner: a system for mining knowledge in large relational databases. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp. 250-255.Google Scholar
  41. Hasling, D.W., Clancey, W.J., and Rennels, G. 1984. Strategic explanations for a diagnostic consultation system. Developments in Expert Systems, M.J. Coombs (Ed.), London, UK: Academic Press, pp. 117-133.Google Scholar
  42. Haussler, D. 1988. Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. Artificial Intelligence, 36:177-221.Google Scholar
  43. Heckerman, D., Geiger, D., and Chickering, D.M. 1995. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197-243.Google Scholar
  44. Holte, R.C. 1993. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11:63-91.Google Scholar
  45. Imielinski, T., Virmani, A., and Abdulghani, A. 1996. DataMine: application programming interface and query language for database mining. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp. 256-262.Google Scholar
  46. Jensen, D. 1992. Induction with Randomization Testing: Decision-Oriented Analysis of Large Data Sets. Unpublished doctoral dissertation, Washington University, Saint Louis, MO.Google Scholar
  47. Jensen, D. and Cohen, P.R. 1999. Multiple comparisons in induction algorithms. Machine Learning, to appear.Google Scholar
  48. Jensen, D. and Schmill, M. (1997). Adjusting for multiple comparisons in decision tree pruning. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 195-198.Google Scholar
  49. Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. Proceedings of the Tenth European Conference on Machine Learning, Chemnitz, Germany: Springer-Verlag.Google Scholar
  50. Kamber, M., Han, J., and Chiang, J.Y. 1997. Metarule-guided mining of multi-dimensional association rules using data cubes. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 207-210.Google Scholar
  51. Kohavi, R. and Kunz, C. 1997. Option decision trees with majority votes. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 161-169.Google Scholar
  52. Kohavi, R. and Sommerfield, D. 1998. Targeting business users with decision table classifiers. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 249-253.Google Scholar
  53. Kong, E.B. and Dietterich, T.G. 1995. Error-correcting output coding corrects bias and variance. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 313-321.Google Scholar
  54. Kononenko, I. 1990. Comparison of inductive and naive Bayesian learning approaches to automatic knowledge acquisition. In Current Trends in Knowledge Acquisition, B. Wielinga (Ed.). Amsterdam, The Netherlands: IOS Press.Google Scholar
  55. Langley, P. 1996. Induction of condensed determinations. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp. 327-330.Google Scholar
  56. Lawrence, S., Giles, C.L., and Tsoi, A.C. 1997. Lessons in neural network training: Overfitting may be harder than expected. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI: AAAI Press, pp. 540-545.Google Scholar
  57. Lee, Y., Buchanan, B.G., and Aronis, J.M. 1998. Knowledge-based learning in exploratory science: Learning rules to predict rodent carcinogenicity. Machine Learning, 30:217-240.Google Scholar
  58. Liu, B., Hsu, W., and Chen, S. 1997. Using general impressions to analyze discovered classification rules. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 31-36.Google Scholar
  59. MacKay, D. 1992. Bayesian interpolation. Neural Computation, 4:415-447.Google Scholar
  60. Maclin, R. and Opitz, D. 1997. An empirical evaluation of bagging and boosting. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI: AAAI Press.Google Scholar
  61. Maclin, R. and Shavlik, J. 1996. Creating advice-taking reinforcement learners. Machine Learning, 22:251-281.Google Scholar
  62. Meo, R., Psaila, G., and Ceri, S. 1996. A new SQL-like operator for mining association rules. Proceedings of the Twenty-Second International Conference on Very Large Databases, Bombay, India: Morgan Kaufmann, pp. 122-133.Google Scholar
  63. Miller, Jr., R.G. 1981. Simultaneous Statistical Inference, 2nd ed. New York, NY: Springer-Verlag.Google Scholar
  64. Mingers, J. 1989. An empirical comparison of pruning methods for decision tree induction. Machine Learning, 4:227-243.Google Scholar
  65. Mitchell, T.M. 1980. The need for biases in learning generalizations, Technical report, New Brunswick, NJ: Computer Science Department, Rutgers University.Google Scholar
  66. Murphy, P. and Pazzani, M. 1994. Exploring the decision forest: An empirical investigation of Occam's razor in decision tree induction. Journal of Artificial Intelligence Research, 1:257-275.Google Scholar
  67. Murthy, S. and Salzberg, S. 1995. Lookahead and pathology in decision tree induction. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montréal, Canada: Morgan Kaufmann, pp. 1025-1031.Google Scholar
  68. Nédellec, C., Rouveirol, C., Adé, H., Bergadano, F., and Tausend, B. 1996. Declarative bias in ILP. In Advances in Inductive Logic Programming, L. de Raedt (Ed.). Amsterdam, the Netherlands: IOS Press, pp. 82-103.Google Scholar
  69. Oates, T. and Jensen, D. 1998. Large datasets lead to overly complex models: An explanation and a solution. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 294-298.Google Scholar
  70. Ourston, D. and Mooney, R.J. 1994. Theory refinement combining analytical and empirical methods. Artificial Intelligence, 66:273-309.Google Scholar
  71. Padmanabhan, B. and Tuzhilin, A. 1998. A belief-driven method for discovering unexpected patterns. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 94-100.Google Scholar
  72. Pazzani, M., Mani, S., and Shankle, W.R. 1997. Beyond concise and colorful: Learning intelligible rules. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 235-238.Google Scholar
  73. Pazzani, M.J. 1991. Influence of prior knowledge on concept acquisition: Experimental and computational results. Journal of Experimental Psychology: Learning, Memory, and Cognition, 17:416-432.Google Scholar
  74. Pearl, J. 1978. On the connection between the complexity and credibility of inferred models. International Journal of General Systems, 4:255-264.Google Scholar
  75. Piatetsky-Shapiro, G. 1996. Editorial comments. KDD Nuggets, 96:28.Google Scholar
  76. Provost, F. and Jensen, D. 1998. KDD-98 Tutorial on Evaluating Knowledge Discovery and Data Mining. New York, NY: AAAI Press.Google Scholar
  77. Quinlan, J.R. 1996. Bagging, boosting, and C4.5. Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, OR: AAAI Press, pp. 725-730.Google Scholar
  78. Quinlan, J.R. and Cameron-Jones, R.M. 1995. Oversearching and layered search in empirical learning. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montréal, Canada: Morgan Kaufmann, pp. 1019-1024.Google Scholar
  79. Quinlan, J.R. and Rivest, R.L. 1989. Inferring decision trees using the minimum description length principle. Information and Computation, 80:227-248.Google Scholar
  80. Rao, J.S. and Potts, W.J.E. 1997. Visualizing bagged decision trees. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 243-246.Google Scholar
  81. Rao, R.B., Gordon, D., and Spears, W. 1995. For every action, is there really an equal and opposite reaction? Analysis of the conservation law for generalization performance. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 471-479.Google Scholar
  82. Rissanen, J. 1978. Modeling by shortest data description. Automatica, 14:465-471.Google Scholar
  83. Russell, S.J. 1986. Preliminary steps towards the automation of induction. Proceedings of the Fifth National Conference on Artificial Intelligence, Philadelphia, PA: AAAI Press, pp. 477-484.Google Scholar
  84. Schaffer, C. 1993. Overfitting avoidance as bias. Machine Learning, 10:153-178.Google Scholar
  85. Schaffer, C. 1994. A conservation law for generalization performance. Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ: Morgan Kaufmann, pp. 259-265.Google Scholar
  86. Schapire, R.E., Freund, Y., Bartlett, P., and Lee, W.S. 1997. Boosting the margin: A new explanation for the effectiveness of voting methods. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann.Google Scholar
  87. Schölkopf, B., Burges, C., and Smola, A. 1998. Advances in Kernel Methods: Support Vector Machines. Cambridge, MA: MIT Press.Google Scholar
  88. Schölkopf, B., Burges, C., and Vapnik, V. 1995. Extracting support data for a given task. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Montréal, Canada: AAAI Press, pp. 252-257.Google Scholar
  89. Schuurmans, D. 1997. A new metric-based approach to model selection. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI: AAAI Press, pp. 552-558.Google Scholar
  90. Schuurmans, D., Ungar, L.H., and Foster, D.P. 1997. Characterizing the generalization performance of model selection strategies. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 340-348.Google Scholar
  91. Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6:461-464.Google Scholar
  92. Shawe-Taylor, J., Bartlett, P.L., Williamson, R.C., and Anthony, M. 1996. Structural risk minimization over data-dependent hierarchies, Technical report No. NC-TR-96-053, Egham, UK: Department of Computer Science, Royal Holloway, University of London.Google Scholar
  93. Shen, W.-M., Ong, K., Mitbander, B., and Zaniolo, C. 1996. Metaqueries for data mining. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). Menlo Park, CA: AAAI Press, pp. 375-398.Google Scholar
  94. Smola, A., Bartlett, P., Schölkopf, B., and Schuurmans, D. (Eds.). 1998. Proceedings of the NIPS-98 Workshop on Large Margin Classifiers, Breckenridge, CO: NIPS Foundation.Google Scholar
  95. Srikant, R., Vu, Q., and Agrawal, R. 1997. Mining association rules with item constraints. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 67-73.Google Scholar
  96. Todorovski, L. and Džeroski, S. 1997. Declarative bias in equation discovery. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 376-384.Google Scholar
  97. Tornay, S.C. 1938. Ockham: Studies and Selections. La Salle, IL: Open Court.Google Scholar
  98. Vapnik, V.N. 1995. The Nature of Statistical Learning Theory. New York, NY: Springer-Verlag.Google Scholar
  99. Wallace, C.S. and Boulton, D.M. 1968. An information measure for classification. Computer Journal, 11:185-194.Google Scholar
  100. Webb, G.I. 1996. Further experimental evidence against the utility of Occam's razor. Journal of Artificial Intelligence Research, 4:397-417.Google Scholar
  101. Webb, G.I. 1997. Decision tree grafting. Proceeding of the Fifteenth International Joint Conference on Artificial Intelligence, Nagoya, Japan: Morgan Kaufmann, pp. 846-851.Google Scholar
  102. Wolpert, D. 1992. Stacked generalization. Neural Networks, 5:241-259.Google Scholar
  103. Wolpert, D. 1996. The lack of a priori distinctions between learning algorithms. Neural Computation, 8:1341-1390.Google Scholar

Copyright information

© Kluwer Academic Publishers 1999

Authors and Affiliations

  • Pedro Domingos
    • 1
  1. 1.Department of Computer Science and EngineeringUniversity of WashingtonSeattle

Personalised recommendations