The Role of Occam's Razor in Knowledge Discovery

Abstract

Many KDD systems incorporate an implicit or explicit preference for simpler models, but this use of “Occam's razor” has been strongly criticized by several authors (e.g., Schaffer, 1993; Webb, 1996). This controversy arises partly because Occam's razor has been interpreted in two quite different ways. The first interpretation (simplicity is a goal in itself) is essentially correct, but is at heart a preference for more comprehensible models. The second interpretation (simplicity leads to greater accuracy) is much more problematic. A critical review of the theoretical arguments for and against it shows that it is unfounded as a universal principle, and demonstrably false. A review of empirical evidence shows that it also fails as a practical heuristic. This article argues that its continued use in KDD risks causing significant opportunities to be missed, and should therefore be restricted to the comparatively few applications where it is appropriate. The article proposes and reviews the use of domain constraints as an alternative for avoiding overfitting, and examines possible methods for handling the accuracy–comprehensibility trade-off.

This is a preview of subscription content, access via your institution.

References

  1. Abu-Mostafa, Y.S. 1989. Learning from hints in neural networks. Journal of Complexity, 6:192-198.

    Google Scholar 

  2. Akaike, H. 1978. A Bayesian analysis of the minimum AIC procedure. Annals of the Institute of Statistical Mathematics, 30A:9-14.

    Google Scholar 

  3. Andrews, R. and Diederich, J. (Eds.). 1996. Proceedings of the NIPS-96 Workshop on Rule Extraction from Trained Artificial Neural Networks, Snowmass, CO: NIPS Foundation.

    Google Scholar 

  4. Bernardo, J.M. and Smith, A.F.M. 1994. Bayesian Theory. New York, NY: Wiley.

    Google Scholar 

  5. Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Oxford, UK: Oxford University Press.

    Google Scholar 

  6. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M.K. 1987. Occam's razor. Information Processing Letters, 24:377-380.

    Google Scholar 

  7. Breiman, L. 1996. Bagging predictors. Machine Learning, 24:123-140.

    Google Scholar 

  8. Breiman, L. and Shang, N. 1997. Born again trees. Technical Report, Berkeley, CA: Statistics Department, University of California at Berkeley.

    Google Scholar 

  9. Brunk, C., Kelly, J., and Kohavi, R. 1997. MineSet: An integrated system for data mining. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 135-138.

    Google Scholar 

  10. Cestnik, B. and Bratko, I. 1988. Learning redundant rules in noisy domains. Proceedings of the Eighth European Conference on Artificial Intelligence, Munich, Germany: Pitman, pp. 348-356.

    Google Scholar 

  11. Cheeseman, P. 1990. On finding the most probable model. In Computational Models of Scientific Discovery and Theory Formation, J. Shrager and P. Langley (Eds.). San Mateo, CA: Morgan Kaufmann, pp. 73-95.

    Google Scholar 

  12. Chickering, D.M. and Heckerman, D. 1997. Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables. Machine Learning, 29:181-212.

    Google Scholar 

  13. Clark, P. and Matwin, S. 1993. Using qualitative models to guide inductive learning. Proceedings of the Tenth International Conference on Machine Learning, Amherst, MA: Morgan Kaufmann, pp. 49-56.

    Google Scholar 

  14. Clearwater, S. and Provost, F. 1990. RL4: A tool for knowledge-based induction. Proceedings of the Second IEEE International Conference on Tools for Artificial Intelligence, San Jose, CA: IEEE Computer Society Press, pp. 24-30.

    Google Scholar 

  15. Cohen, W.W. 1994. Grammatically biased learning: Learning logic programs using an explicit antecedent description language. Artificial Intelligence, 68:303-366.

    Google Scholar 

  16. Cohen, W.W. 1995. Fast effective rule induction. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 115-123.

    Google Scholar 

  17. Cooper, G.F. 1997. A simple constraint-based algorithm for efficiently mining observational databases for causal relationships. Data Mining and Knowledge Discovery, 1:203-224.

    Google Scholar 

  18. Cover, T.M. and Thomas, J.A. 1991. Elements of Information Theory. New York, NY: Wiley.

    Google Scholar 

  19. Craven, M.W. 1996. Extracting comprehensible models from trained neural networks. Unpublished doctoral dissertation, Department of Computer Sciences, University of Wisconsin—Madison, Madison, WI.

  20. Datta, P. and Kibler, D. 1995. Learning prototypical concept descriptions. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 158-166.

    Google Scholar 

  21. Djoko, S., Cook, D.J., and Holder, L.B. 1995. Analyzing the benefits of domain knowledge in substructure discovery. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Montréal, Canada: AAAI Press, pp. 75-80.

    Google Scholar 

  22. Domingos, P. 1996a. Two-way induction. International Journal on Artificial Intelligence Tools, 5:113-125.

    Google Scholar 

  23. Domingos, P. 1996b. Unifying instance-based and rule-based induction. Machine Learning, 24:141-168.

    Google Scholar 

  24. Domingos, P. 1997a. Knowledge acquisition from examples via multiple models. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 98-106.

    Google Scholar 

  25. Domingos, P. 1997b. Why does bagging work? A Bayesian account and its implications. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 155-158.

    Google Scholar 

  26. Domingos, P. 1998a. A process-oriented heuristic for model selection. Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI: Morgan Kaufmann, pp. 127-135.

    Google Scholar 

  27. Domingos, P. 1998b. When (and how) to combine predictive and causal learning. Proceedings of the NIPS-98 Workshop on Integrating Supervised and Unsupervised Learning, Breckenridge, CO: NIPS Foundation.

    Google Scholar 

  28. Domingos, P. 1999. Process-oriented estimation of generalization error. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, Stockholm, Sweden: Morgan Kaufmann.

    Google Scholar 

  29. Domingos, P. and Pazzani, M. 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103-130.

    Google Scholar 

  30. Donoho, S. and Rendell, L. 1996. Constructive induction using fragmentary knowledge. Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy: Morgan Kaufmann, pp. 113-121.

    Google Scholar 

  31. Drucker, H., Cortes, C., Jackel, L.D., LeCun, Y., and Vapnik, V. 1994. Boosting and other machine learning algorithms. Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ: Morgan Kaufmann, pp. 53-61.

    Google Scholar 

  32. Edgington, E.S. 1980. Randomization Tests. New York, NY: Marcel Dekker.

    Google Scholar 

  33. Elomaa, T. 1994. In defense of C4.5: Notes on learning one-level decision trees. Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ: Morgan Kaufmann, pp. 62-69.

    Google Scholar 

  34. Fisher, D.H. and Schlimmer, J.C. 1988. Concept simplification and prediction accuracy. Proceedings of the Fifth International Conference on Machine Learning, Ann Arbor, MI: Morgan Kaufmann, pp. 22-28.

    Google Scholar 

  35. Freund, Y. and Schapire, R.E. 1996. Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy: Morgan Kaufmann, pp. 148-156.

    Google Scholar 

  36. Friedman, J.H. 1997. On bias, variance, 0/1—loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1:55-77.

    Google Scholar 

  37. Gams, M. 1989. New measurements highlight the importance of redundant knowledge. Proceedings of the Fourth European Working Session on Learning, Montpellier, France: Pitman, pp. 71-79.

    Google Scholar 

  38. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Computation, 4:1-58.

    Google Scholar 

  39. Grove, A.J. and Schuurmans, D. 1998. Boosting in the limit: Maximizing the margin of learned ensembles. Proceedings of the Fifteenth National Conference on Artificial Intelligence, Madison, WI: AAAI Press, pp. 692-699.

    Google Scholar 

  40. Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K., Li, D., Lu, Y., Rajan, A., Stefanovic, N., Xia, B., and Zaiane, O. 1996. DB Miner: a system for mining knowledge in large relational databases. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp. 250-255.

    Google Scholar 

  41. Hasling, D.W., Clancey, W.J., and Rennels, G. 1984. Strategic explanations for a diagnostic consultation system. Developments in Expert Systems, M.J. Coombs (Ed.), London, UK: Academic Press, pp. 117-133.

    Google Scholar 

  42. Haussler, D. 1988. Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. Artificial Intelligence, 36:177-221.

    Google Scholar 

  43. Heckerman, D., Geiger, D., and Chickering, D.M. 1995. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197-243.

    Google Scholar 

  44. Holte, R.C. 1993. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11:63-91.

    Google Scholar 

  45. Imielinski, T., Virmani, A., and Abdulghani, A. 1996. DataMine: application programming interface and query language for database mining. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp. 256-262.

    Google Scholar 

  46. Jensen, D. 1992. Induction with Randomization Testing: Decision-Oriented Analysis of Large Data Sets. Unpublished doctoral dissertation, Washington University, Saint Louis, MO.

  47. Jensen, D. and Cohen, P.R. 1999. Multiple comparisons in induction algorithms. Machine Learning, to appear.

  48. Jensen, D. and Schmill, M. (1997). Adjusting for multiple comparisons in decision tree pruning. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 195-198.

    Google Scholar 

  49. Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. Proceedings of the Tenth European Conference on Machine Learning, Chemnitz, Germany: Springer-Verlag.

    Google Scholar 

  50. Kamber, M., Han, J., and Chiang, J.Y. 1997. Metarule-guided mining of multi-dimensional association rules using data cubes. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 207-210.

    Google Scholar 

  51. Kohavi, R. and Kunz, C. 1997. Option decision trees with majority votes. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 161-169.

    Google Scholar 

  52. Kohavi, R. and Sommerfield, D. 1998. Targeting business users with decision table classifiers. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 249-253.

    Google Scholar 

  53. Kong, E.B. and Dietterich, T.G. 1995. Error-correcting output coding corrects bias and variance. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 313-321.

    Google Scholar 

  54. Kononenko, I. 1990. Comparison of inductive and naive Bayesian learning approaches to automatic knowledge acquisition. In Current Trends in Knowledge Acquisition, B. Wielinga (Ed.). Amsterdam, The Netherlands: IOS Press.

    Google Scholar 

  55. Langley, P. 1996. Induction of condensed determinations. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp. 327-330.

    Google Scholar 

  56. Lawrence, S., Giles, C.L., and Tsoi, A.C. 1997. Lessons in neural network training: Overfitting may be harder than expected. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI: AAAI Press, pp. 540-545.

    Google Scholar 

  57. Lee, Y., Buchanan, B.G., and Aronis, J.M. 1998. Knowledge-based learning in exploratory science: Learning rules to predict rodent carcinogenicity. Machine Learning, 30:217-240.

    Google Scholar 

  58. Liu, B., Hsu, W., and Chen, S. 1997. Using general impressions to analyze discovered classification rules. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 31-36.

    Google Scholar 

  59. MacKay, D. 1992. Bayesian interpolation. Neural Computation, 4:415-447.

    Google Scholar 

  60. Maclin, R. and Opitz, D. 1997. An empirical evaluation of bagging and boosting. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI: AAAI Press.

    Google Scholar 

  61. Maclin, R. and Shavlik, J. 1996. Creating advice-taking reinforcement learners. Machine Learning, 22:251-281.

    Google Scholar 

  62. Meo, R., Psaila, G., and Ceri, S. 1996. A new SQL-like operator for mining association rules. Proceedings of the Twenty-Second International Conference on Very Large Databases, Bombay, India: Morgan Kaufmann, pp. 122-133.

    Google Scholar 

  63. Miller, Jr., R.G. 1981. Simultaneous Statistical Inference, 2nd ed. New York, NY: Springer-Verlag.

    Google Scholar 

  64. Mingers, J. 1989. An empirical comparison of pruning methods for decision tree induction. Machine Learning, 4:227-243.

    Google Scholar 

  65. Mitchell, T.M. 1980. The need for biases in learning generalizations, Technical report, New Brunswick, NJ: Computer Science Department, Rutgers University.

    Google Scholar 

  66. Murphy, P. and Pazzani, M. 1994. Exploring the decision forest: An empirical investigation of Occam's razor in decision tree induction. Journal of Artificial Intelligence Research, 1:257-275.

    Google Scholar 

  67. Murthy, S. and Salzberg, S. 1995. Lookahead and pathology in decision tree induction. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montréal, Canada: Morgan Kaufmann, pp. 1025-1031.

    Google Scholar 

  68. Nédellec, C., Rouveirol, C., Adé, H., Bergadano, F., and Tausend, B. 1996. Declarative bias in ILP. In Advances in Inductive Logic Programming, L. de Raedt (Ed.). Amsterdam, the Netherlands: IOS Press, pp. 82-103.

    Google Scholar 

  69. Oates, T. and Jensen, D. 1998. Large datasets lead to overly complex models: An explanation and a solution. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 294-298.

    Google Scholar 

  70. Ourston, D. and Mooney, R.J. 1994. Theory refinement combining analytical and empirical methods. Artificial Intelligence, 66:273-309.

    Google Scholar 

  71. Padmanabhan, B. and Tuzhilin, A. 1998. A belief-driven method for discovering unexpected patterns. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 94-100.

    Google Scholar 

  72. Pazzani, M., Mani, S., and Shankle, W.R. 1997. Beyond concise and colorful: Learning intelligible rules. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 235-238.

    Google Scholar 

  73. Pazzani, M.J. 1991. Influence of prior knowledge on concept acquisition: Experimental and computational results. Journal of Experimental Psychology: Learning, Memory, and Cognition, 17:416-432.

    Google Scholar 

  74. Pearl, J. 1978. On the connection between the complexity and credibility of inferred models. International Journal of General Systems, 4:255-264.

    Google Scholar 

  75. Piatetsky-Shapiro, G. 1996. Editorial comments. KDD Nuggets, 96:28.

    Google Scholar 

  76. Provost, F. and Jensen, D. 1998. KDD-98 Tutorial on Evaluating Knowledge Discovery and Data Mining. New York, NY: AAAI Press.

    Google Scholar 

  77. Quinlan, J.R. 1996. Bagging, boosting, and C4.5. Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, OR: AAAI Press, pp. 725-730.

    Google Scholar 

  78. Quinlan, J.R. and Cameron-Jones, R.M. 1995. Oversearching and layered search in empirical learning. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montréal, Canada: Morgan Kaufmann, pp. 1019-1024.

    Google Scholar 

  79. Quinlan, J.R. and Rivest, R.L. 1989. Inferring decision trees using the minimum description length principle. Information and Computation, 80:227-248.

    Google Scholar 

  80. Rao, J.S. and Potts, W.J.E. 1997. Visualizing bagged decision trees. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 243-246.

    Google Scholar 

  81. Rao, R.B., Gordon, D., and Spears, W. 1995. For every action, is there really an equal and opposite reaction? Analysis of the conservation law for generalization performance. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 471-479.

    Google Scholar 

  82. Rissanen, J. 1978. Modeling by shortest data description. Automatica, 14:465-471.

    Google Scholar 

  83. Russell, S.J. 1986. Preliminary steps towards the automation of induction. Proceedings of the Fifth National Conference on Artificial Intelligence, Philadelphia, PA: AAAI Press, pp. 477-484.

    Google Scholar 

  84. Schaffer, C. 1993. Overfitting avoidance as bias. Machine Learning, 10:153-178.

    Google Scholar 

  85. Schaffer, C. 1994. A conservation law for generalization performance. Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ: Morgan Kaufmann, pp. 259-265.

    Google Scholar 

  86. Schapire, R.E., Freund, Y., Bartlett, P., and Lee, W.S. 1997. Boosting the margin: A new explanation for the effectiveness of voting methods. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann.

    Google Scholar 

  87. Schölkopf, B., Burges, C., and Smola, A. 1998. Advances in Kernel Methods: Support Vector Machines. Cambridge, MA: MIT Press.

    Google Scholar 

  88. Schölkopf, B., Burges, C., and Vapnik, V. 1995. Extracting support data for a given task. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Montréal, Canada: AAAI Press, pp. 252-257.

    Google Scholar 

  89. Schuurmans, D. 1997. A new metric-based approach to model selection. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI: AAAI Press, pp. 552-558.

    Google Scholar 

  90. Schuurmans, D., Ungar, L.H., and Foster, D.P. 1997. Characterizing the generalization performance of model selection strategies. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 340-348.

    Google Scholar 

  91. Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6:461-464.

    Google Scholar 

  92. Shawe-Taylor, J., Bartlett, P.L., Williamson, R.C., and Anthony, M. 1996. Structural risk minimization over data-dependent hierarchies, Technical report No. NC-TR-96-053, Egham, UK: Department of Computer Science, Royal Holloway, University of London.

    Google Scholar 

  93. Shen, W.-M., Ong, K., Mitbander, B., and Zaniolo, C. 1996. Metaqueries for data mining. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). Menlo Park, CA: AAAI Press, pp. 375-398.

    Google Scholar 

  94. Smola, A., Bartlett, P., Schölkopf, B., and Schuurmans, D. (Eds.). 1998. Proceedings of the NIPS-98 Workshop on Large Margin Classifiers, Breckenridge, CO: NIPS Foundation.

    Google Scholar 

  95. Srikant, R., Vu, Q., and Agrawal, R. 1997. Mining association rules with item constraints. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 67-73.

    Google Scholar 

  96. Todorovski, L. and Džeroski, S. 1997. Declarative bias in equation discovery. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 376-384.

    Google Scholar 

  97. Tornay, S.C. 1938. Ockham: Studies and Selections. La Salle, IL: Open Court.

    Google Scholar 

  98. Vapnik, V.N. 1995. The Nature of Statistical Learning Theory. New York, NY: Springer-Verlag.

    Google Scholar 

  99. Wallace, C.S. and Boulton, D.M. 1968. An information measure for classification. Computer Journal, 11:185-194.

    Google Scholar 

  100. Webb, G.I. 1996. Further experimental evidence against the utility of Occam's razor. Journal of Artificial Intelligence Research, 4:397-417.

    Google Scholar 

  101. Webb, G.I. 1997. Decision tree grafting. Proceeding of the Fifteenth International Joint Conference on Artificial Intelligence, Nagoya, Japan: Morgan Kaufmann, pp. 846-851.

    Google Scholar 

  102. Wolpert, D. 1992. Stacked generalization. Neural Networks, 5:241-259.

    Google Scholar 

  103. Wolpert, D. 1996. The lack of a priori distinctions between learning algorithms. Neural Computation, 8:1341-1390.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Domingos, P. The Role of Occam's Razor in Knowledge Discovery. Data Mining and Knowledge Discovery 3, 409–425 (1999). https://doi.org/10.1023/A:1009868929893

Download citation

  • model selection
  • overfitting
  • multiple comparisons
  • comprehensible models
  • domain knowledge