Skip to main content

The Role of Occam's Razor in Knowledge Discovery

Abstract

Many KDD systems incorporate an implicit or explicit preference for simpler models, but this use of “Occam's razor” has been strongly criticized by several authors (e.g., Schaffer, 1993; Webb, 1996). This controversy arises partly because Occam's razor has been interpreted in two quite different ways. The first interpretation (simplicity is a goal in itself) is essentially correct, but is at heart a preference for more comprehensible models. The second interpretation (simplicity leads to greater accuracy) is much more problematic. A critical review of the theoretical arguments for and against it shows that it is unfounded as a universal principle, and demonstrably false. A review of empirical evidence shows that it also fails as a practical heuristic. This article argues that its continued use in KDD risks causing significant opportunities to be missed, and should therefore be restricted to the comparatively few applications where it is appropriate. The article proposes and reviews the use of domain constraints as an alternative for avoiding overfitting, and examines possible methods for handling the accuracy–comprehensibility trade-off.

This is a preview of subscription content, access via your institution.

References

  • Abu-Mostafa, Y.S. 1989. Learning from hints in neural networks. Journal of Complexity, 6:192-198.

    Google Scholar 

  • Akaike, H. 1978. A Bayesian analysis of the minimum AIC procedure. Annals of the Institute of Statistical Mathematics, 30A:9-14.

    Google Scholar 

  • Andrews, R. and Diederich, J. (Eds.). 1996. Proceedings of the NIPS-96 Workshop on Rule Extraction from Trained Artificial Neural Networks, Snowmass, CO: NIPS Foundation.

    Google Scholar 

  • Bernardo, J.M. and Smith, A.F.M. 1994. Bayesian Theory. New York, NY: Wiley.

    Google Scholar 

  • Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Oxford, UK: Oxford University Press.

    Google Scholar 

  • Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M.K. 1987. Occam's razor. Information Processing Letters, 24:377-380.

    Google Scholar 

  • Breiman, L. 1996. Bagging predictors. Machine Learning, 24:123-140.

    Google Scholar 

  • Breiman, L. and Shang, N. 1997. Born again trees. Technical Report, Berkeley, CA: Statistics Department, University of California at Berkeley.

    Google Scholar 

  • Brunk, C., Kelly, J., and Kohavi, R. 1997. MineSet: An integrated system for data mining. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 135-138.

    Google Scholar 

  • Cestnik, B. and Bratko, I. 1988. Learning redundant rules in noisy domains. Proceedings of the Eighth European Conference on Artificial Intelligence, Munich, Germany: Pitman, pp. 348-356.

    Google Scholar 

  • Cheeseman, P. 1990. On finding the most probable model. In Computational Models of Scientific Discovery and Theory Formation, J. Shrager and P. Langley (Eds.). San Mateo, CA: Morgan Kaufmann, pp. 73-95.

    Google Scholar 

  • Chickering, D.M. and Heckerman, D. 1997. Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables. Machine Learning, 29:181-212.

    Google Scholar 

  • Clark, P. and Matwin, S. 1993. Using qualitative models to guide inductive learning. Proceedings of the Tenth International Conference on Machine Learning, Amherst, MA: Morgan Kaufmann, pp. 49-56.

    Google Scholar 

  • Clearwater, S. and Provost, F. 1990. RL4: A tool for knowledge-based induction. Proceedings of the Second IEEE International Conference on Tools for Artificial Intelligence, San Jose, CA: IEEE Computer Society Press, pp. 24-30.

    Google Scholar 

  • Cohen, W.W. 1994. Grammatically biased learning: Learning logic programs using an explicit antecedent description language. Artificial Intelligence, 68:303-366.

    Google Scholar 

  • Cohen, W.W. 1995. Fast effective rule induction. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 115-123.

    Google Scholar 

  • Cooper, G.F. 1997. A simple constraint-based algorithm for efficiently mining observational databases for causal relationships. Data Mining and Knowledge Discovery, 1:203-224.

    Google Scholar 

  • Cover, T.M. and Thomas, J.A. 1991. Elements of Information Theory. New York, NY: Wiley.

    Google Scholar 

  • Craven, M.W. 1996. Extracting comprehensible models from trained neural networks. Unpublished doctoral dissertation, Department of Computer Sciences, University of Wisconsin—Madison, Madison, WI.

  • Datta, P. and Kibler, D. 1995. Learning prototypical concept descriptions. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 158-166.

    Google Scholar 

  • Djoko, S., Cook, D.J., and Holder, L.B. 1995. Analyzing the benefits of domain knowledge in substructure discovery. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Montréal, Canada: AAAI Press, pp. 75-80.

    Google Scholar 

  • Domingos, P. 1996a. Two-way induction. International Journal on Artificial Intelligence Tools, 5:113-125.

    Google Scholar 

  • Domingos, P. 1996b. Unifying instance-based and rule-based induction. Machine Learning, 24:141-168.

    Google Scholar 

  • Domingos, P. 1997a. Knowledge acquisition from examples via multiple models. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 98-106.

    Google Scholar 

  • Domingos, P. 1997b. Why does bagging work? A Bayesian account and its implications. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 155-158.

    Google Scholar 

  • Domingos, P. 1998a. A process-oriented heuristic for model selection. Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI: Morgan Kaufmann, pp. 127-135.

    Google Scholar 

  • Domingos, P. 1998b. When (and how) to combine predictive and causal learning. Proceedings of the NIPS-98 Workshop on Integrating Supervised and Unsupervised Learning, Breckenridge, CO: NIPS Foundation.

    Google Scholar 

  • Domingos, P. 1999. Process-oriented estimation of generalization error. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, Stockholm, Sweden: Morgan Kaufmann.

    Google Scholar 

  • Domingos, P. and Pazzani, M. 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103-130.

    Google Scholar 

  • Donoho, S. and Rendell, L. 1996. Constructive induction using fragmentary knowledge. Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy: Morgan Kaufmann, pp. 113-121.

    Google Scholar 

  • Drucker, H., Cortes, C., Jackel, L.D., LeCun, Y., and Vapnik, V. 1994. Boosting and other machine learning algorithms. Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ: Morgan Kaufmann, pp. 53-61.

    Google Scholar 

  • Edgington, E.S. 1980. Randomization Tests. New York, NY: Marcel Dekker.

    Google Scholar 

  • Elomaa, T. 1994. In defense of C4.5: Notes on learning one-level decision trees. Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ: Morgan Kaufmann, pp. 62-69.

    Google Scholar 

  • Fisher, D.H. and Schlimmer, J.C. 1988. Concept simplification and prediction accuracy. Proceedings of the Fifth International Conference on Machine Learning, Ann Arbor, MI: Morgan Kaufmann, pp. 22-28.

    Google Scholar 

  • Freund, Y. and Schapire, R.E. 1996. Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy: Morgan Kaufmann, pp. 148-156.

    Google Scholar 

  • Friedman, J.H. 1997. On bias, variance, 0/1—loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1:55-77.

    Google Scholar 

  • Gams, M. 1989. New measurements highlight the importance of redundant knowledge. Proceedings of the Fourth European Working Session on Learning, Montpellier, France: Pitman, pp. 71-79.

    Google Scholar 

  • Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Computation, 4:1-58.

    Google Scholar 

  • Grove, A.J. and Schuurmans, D. 1998. Boosting in the limit: Maximizing the margin of learned ensembles. Proceedings of the Fifteenth National Conference on Artificial Intelligence, Madison, WI: AAAI Press, pp. 692-699.

    Google Scholar 

  • Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K., Li, D., Lu, Y., Rajan, A., Stefanovic, N., Xia, B., and Zaiane, O. 1996. DB Miner: a system for mining knowledge in large relational databases. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp. 250-255.

    Google Scholar 

  • Hasling, D.W., Clancey, W.J., and Rennels, G. 1984. Strategic explanations for a diagnostic consultation system. Developments in Expert Systems, M.J. Coombs (Ed.), London, UK: Academic Press, pp. 117-133.

    Google Scholar 

  • Haussler, D. 1988. Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. Artificial Intelligence, 36:177-221.

    Google Scholar 

  • Heckerman, D., Geiger, D., and Chickering, D.M. 1995. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197-243.

    Google Scholar 

  • Holte, R.C. 1993. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11:63-91.

    Google Scholar 

  • Imielinski, T., Virmani, A., and Abdulghani, A. 1996. DataMine: application programming interface and query language for database mining. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp. 256-262.

    Google Scholar 

  • Jensen, D. 1992. Induction with Randomization Testing: Decision-Oriented Analysis of Large Data Sets. Unpublished doctoral dissertation, Washington University, Saint Louis, MO.

  • Jensen, D. and Cohen, P.R. 1999. Multiple comparisons in induction algorithms. Machine Learning, to appear.

  • Jensen, D. and Schmill, M. (1997). Adjusting for multiple comparisons in decision tree pruning. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 195-198.

    Google Scholar 

  • Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. Proceedings of the Tenth European Conference on Machine Learning, Chemnitz, Germany: Springer-Verlag.

    Google Scholar 

  • Kamber, M., Han, J., and Chiang, J.Y. 1997. Metarule-guided mining of multi-dimensional association rules using data cubes. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 207-210.

    Google Scholar 

  • Kohavi, R. and Kunz, C. 1997. Option decision trees with majority votes. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 161-169.

    Google Scholar 

  • Kohavi, R. and Sommerfield, D. 1998. Targeting business users with decision table classifiers. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 249-253.

    Google Scholar 

  • Kong, E.B. and Dietterich, T.G. 1995. Error-correcting output coding corrects bias and variance. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 313-321.

    Google Scholar 

  • Kononenko, I. 1990. Comparison of inductive and naive Bayesian learning approaches to automatic knowledge acquisition. In Current Trends in Knowledge Acquisition, B. Wielinga (Ed.). Amsterdam, The Netherlands: IOS Press.

    Google Scholar 

  • Langley, P. 1996. Induction of condensed determinations. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp. 327-330.

    Google Scholar 

  • Lawrence, S., Giles, C.L., and Tsoi, A.C. 1997. Lessons in neural network training: Overfitting may be harder than expected. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI: AAAI Press, pp. 540-545.

    Google Scholar 

  • Lee, Y., Buchanan, B.G., and Aronis, J.M. 1998. Knowledge-based learning in exploratory science: Learning rules to predict rodent carcinogenicity. Machine Learning, 30:217-240.

    Google Scholar 

  • Liu, B., Hsu, W., and Chen, S. 1997. Using general impressions to analyze discovered classification rules. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 31-36.

    Google Scholar 

  • MacKay, D. 1992. Bayesian interpolation. Neural Computation, 4:415-447.

    Google Scholar 

  • Maclin, R. and Opitz, D. 1997. An empirical evaluation of bagging and boosting. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI: AAAI Press.

    Google Scholar 

  • Maclin, R. and Shavlik, J. 1996. Creating advice-taking reinforcement learners. Machine Learning, 22:251-281.

    Google Scholar 

  • Meo, R., Psaila, G., and Ceri, S. 1996. A new SQL-like operator for mining association rules. Proceedings of the Twenty-Second International Conference on Very Large Databases, Bombay, India: Morgan Kaufmann, pp. 122-133.

    Google Scholar 

  • Miller, Jr., R.G. 1981. Simultaneous Statistical Inference, 2nd ed. New York, NY: Springer-Verlag.

    Google Scholar 

  • Mingers, J. 1989. An empirical comparison of pruning methods for decision tree induction. Machine Learning, 4:227-243.

    Google Scholar 

  • Mitchell, T.M. 1980. The need for biases in learning generalizations, Technical report, New Brunswick, NJ: Computer Science Department, Rutgers University.

    Google Scholar 

  • Murphy, P. and Pazzani, M. 1994. Exploring the decision forest: An empirical investigation of Occam's razor in decision tree induction. Journal of Artificial Intelligence Research, 1:257-275.

    Google Scholar 

  • Murthy, S. and Salzberg, S. 1995. Lookahead and pathology in decision tree induction. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montréal, Canada: Morgan Kaufmann, pp. 1025-1031.

    Google Scholar 

  • Nédellec, C., Rouveirol, C., Adé, H., Bergadano, F., and Tausend, B. 1996. Declarative bias in ILP. In Advances in Inductive Logic Programming, L. de Raedt (Ed.). Amsterdam, the Netherlands: IOS Press, pp. 82-103.

    Google Scholar 

  • Oates, T. and Jensen, D. 1998. Large datasets lead to overly complex models: An explanation and a solution. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 294-298.

    Google Scholar 

  • Ourston, D. and Mooney, R.J. 1994. Theory refinement combining analytical and empirical methods. Artificial Intelligence, 66:273-309.

    Google Scholar 

  • Padmanabhan, B. and Tuzhilin, A. 1998. A belief-driven method for discovering unexpected patterns. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 94-100.

    Google Scholar 

  • Pazzani, M., Mani, S., and Shankle, W.R. 1997. Beyond concise and colorful: Learning intelligible rules. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 235-238.

    Google Scholar 

  • Pazzani, M.J. 1991. Influence of prior knowledge on concept acquisition: Experimental and computational results. Journal of Experimental Psychology: Learning, Memory, and Cognition, 17:416-432.

    Google Scholar 

  • Pearl, J. 1978. On the connection between the complexity and credibility of inferred models. International Journal of General Systems, 4:255-264.

    Google Scholar 

  • Piatetsky-Shapiro, G. 1996. Editorial comments. KDD Nuggets, 96:28.

    Google Scholar 

  • Provost, F. and Jensen, D. 1998. KDD-98 Tutorial on Evaluating Knowledge Discovery and Data Mining. New York, NY: AAAI Press.

    Google Scholar 

  • Quinlan, J.R. 1996. Bagging, boosting, and C4.5. Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, OR: AAAI Press, pp. 725-730.

    Google Scholar 

  • Quinlan, J.R. and Cameron-Jones, R.M. 1995. Oversearching and layered search in empirical learning. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montréal, Canada: Morgan Kaufmann, pp. 1019-1024.

    Google Scholar 

  • Quinlan, J.R. and Rivest, R.L. 1989. Inferring decision trees using the minimum description length principle. Information and Computation, 80:227-248.

    Google Scholar 

  • Rao, J.S. and Potts, W.J.E. 1997. Visualizing bagged decision trees. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 243-246.

    Google Scholar 

  • Rao, R.B., Gordon, D., and Spears, W. 1995. For every action, is there really an equal and opposite reaction? Analysis of the conservation law for generalization performance. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 471-479.

    Google Scholar 

  • Rissanen, J. 1978. Modeling by shortest data description. Automatica, 14:465-471.

    Google Scholar 

  • Russell, S.J. 1986. Preliminary steps towards the automation of induction. Proceedings of the Fifth National Conference on Artificial Intelligence, Philadelphia, PA: AAAI Press, pp. 477-484.

    Google Scholar 

  • Schaffer, C. 1993. Overfitting avoidance as bias. Machine Learning, 10:153-178.

    Google Scholar 

  • Schaffer, C. 1994. A conservation law for generalization performance. Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ: Morgan Kaufmann, pp. 259-265.

    Google Scholar 

  • Schapire, R.E., Freund, Y., Bartlett, P., and Lee, W.S. 1997. Boosting the margin: A new explanation for the effectiveness of voting methods. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann.

    Google Scholar 

  • Schölkopf, B., Burges, C., and Smola, A. 1998. Advances in Kernel Methods: Support Vector Machines. Cambridge, MA: MIT Press.

    Google Scholar 

  • Schölkopf, B., Burges, C., and Vapnik, V. 1995. Extracting support data for a given task. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Montréal, Canada: AAAI Press, pp. 252-257.

    Google Scholar 

  • Schuurmans, D. 1997. A new metric-based approach to model selection. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI: AAAI Press, pp. 552-558.

    Google Scholar 

  • Schuurmans, D., Ungar, L.H., and Foster, D.P. 1997. Characterizing the generalization performance of model selection strategies. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 340-348.

    Google Scholar 

  • Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6:461-464.

    Google Scholar 

  • Shawe-Taylor, J., Bartlett, P.L., Williamson, R.C., and Anthony, M. 1996. Structural risk minimization over data-dependent hierarchies, Technical report No. NC-TR-96-053, Egham, UK: Department of Computer Science, Royal Holloway, University of London.

    Google Scholar 

  • Shen, W.-M., Ong, K., Mitbander, B., and Zaniolo, C. 1996. Metaqueries for data mining. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). Menlo Park, CA: AAAI Press, pp. 375-398.

    Google Scholar 

  • Smola, A., Bartlett, P., Schölkopf, B., and Schuurmans, D. (Eds.). 1998. Proceedings of the NIPS-98 Workshop on Large Margin Classifiers, Breckenridge, CO: NIPS Foundation.

    Google Scholar 

  • Srikant, R., Vu, Q., and Agrawal, R. 1997. Mining association rules with item constraints. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 67-73.

    Google Scholar 

  • Todorovski, L. and Džeroski, S. 1997. Declarative bias in equation discovery. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 376-384.

    Google Scholar 

  • Tornay, S.C. 1938. Ockham: Studies and Selections. La Salle, IL: Open Court.

    Google Scholar 

  • Vapnik, V.N. 1995. The Nature of Statistical Learning Theory. New York, NY: Springer-Verlag.

    Google Scholar 

  • Wallace, C.S. and Boulton, D.M. 1968. An information measure for classification. Computer Journal, 11:185-194.

    Google Scholar 

  • Webb, G.I. 1996. Further experimental evidence against the utility of Occam's razor. Journal of Artificial Intelligence Research, 4:397-417.

    Google Scholar 

  • Webb, G.I. 1997. Decision tree grafting. Proceeding of the Fifteenth International Joint Conference on Artificial Intelligence, Nagoya, Japan: Morgan Kaufmann, pp. 846-851.

    Google Scholar 

  • Wolpert, D. 1992. Stacked generalization. Neural Networks, 5:241-259.

    Google Scholar 

  • Wolpert, D. 1996. The lack of a priori distinctions between learning algorithms. Neural Computation, 8:1341-1390.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Domingos, P. The Role of Occam's Razor in Knowledge Discovery. Data Mining and Knowledge Discovery 3, 409–425 (1999). https://doi.org/10.1023/A:1009868929893

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1009868929893

  • model selection
  • overfitting
  • multiple comparisons
  • comprehensible models
  • domain knowledge