Machine Learning

, Volume 38, Issue 3, pp 309–338 | Cite as

Multiple Comparisons in Induction Algorithms

  • David D. Jensen
  • Paul R. Cohen
Article

Abstract

A single mechanism is responsible for three pathologies of induction algorithms: attribute selection errors, overfitting, and oversearching. In each pathology, induction algorithms compare multiple items based on scores from an evaluation function and select the item with the maximum score. We call this a multiple comparison procedure (MCP). We analyze the statistical properties of MCPs and show how failure to adjust for these properties leads to the pathologies. We also discuss approaches that can control pathological behavior, including Bonferroni adjustment, randomization testing, and cross-validation.

inductive learning overfitting oversearching attribute selection hypothesis testing parameter estimation 

References

  1. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. (1989). Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36, 929–965.Google Scholar
  2. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. Belmont, CA: Wadsworth International.Google Scholar
  3. Brodley, C. & Rissland, E. (1993). Measuring concept change. Training Issues in Incremental Learning: Papers from the 1993 Spring Symposium (pp. 99–108). Menlo Park, CA: AAAI Press.Google Scholar
  4. Cohen, P. R. (1995). Empirical Methods for Artificial Intelligence. Cambridge, MA: MIT Press.Google Scholar
  5. Dietterich, T. (1995). Overfitting and under-computing in machine learning. ACM Computing Surveys, 27, 326–327.Google Scholar
  6. Edgington, E. (1995). Randomization Tests (3rd edition). New York, NY: Marcel Dekker.Google Scholar
  7. Einhorn, H. (1972). Alchemy in the behavioral sciences. Public Opinion Quarterly, 36, 367–378.Google Scholar
  8. Fayyad, U. & Irani, K. (1992). The attribute selection problem in decision tree generation. Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI-92) (pp. 104–110). Menlo Park, CA: AAAI Press.Google Scholar
  9. Feelders, A. & Verkooijen, W. (1996). On the statistical comparison of inductive learning methods. In D. Fisher & H.-J. Lenz (Eds.), Learning from Data: Artificial and Intelligence V. New York, NY: Springer Verlag.Google Scholar
  10. Fisher, D. & Schlimmer, J. (1988). Concept simplification and prediction accuracy. Proceedings of the Fifth International Conference on Machine Learning (pp. 22–28). San Mateo, CA: Morgan Kaufmann.Google Scholar
  11. Gaines, B. (1989). An ounce of knowledge is worth a ton of data: Quantitative studies of the trade-off between expertise and data based on statistically well-founded empirical induction. Proceedings of the Sixth International Workshop on Machine Learning (pp. 156–159). San Mateo, CA: Morgan Kaufmann.Google Scholar
  12. Gascuel, O. & Caraux, G. (1992). Statistical significance in inductive learning. Proceedings of the Tenth European Conference on Artificial Intelligence (pp. 435–439). Chichester: Wiley.Google Scholar
  13. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–58.Google Scholar
  14. Hand, D. & Taylor, C. (1987). Multivariate Analysis of Variance and Repeated Measures: A Practical Approach for Behavioural Scientists. London: Chapman and Hall.Google Scholar
  15. Hawkins, D. & Kass, G. (1982). Automatic interation detection. In D. Hawkins (Ed.), Topics in Applied Multivariate Analysis. Cambridge: Cambridge University Press.Google Scholar
  16. Iba, W., Wogulis, J., & Langley, P. (1988). Trading off simplicity and coverage in incremental concept learning. Proceedings of the Fifth International Conference on Machine Learning (pp. 73–79). San Mateo, CA: Morgan Kaufmann.Google Scholar
  17. Jensen, D. (1991). Knowledge discovery through induction with randomization testing. Proceedings of the 1991 Knowledge Discovery in Databases Workshop (pp. 148–159). Menlo Park, CA: AAAI.Google Scholar
  18. Jensen, D. (1992). Induction with Randomization Testing: Decision-Oriented Analysis of Large Data Sets. Doctoral dissertation. St. Louis, MO: Washington University.Google Scholar
  19. Jensen, D. & Schmill, M. (1997). Adjusting for multiple comparisons in decision tree pruning. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (pp. 195–198). Menlo Park, CA: AAAI Press.Google Scholar
  20. Kass, G. (1975). Significance testing in Automatic Interaction Detection (A.I.D.). Applied Statistics, 24, 178–189.Google Scholar
  21. Kass, G. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29, 119–127.Google Scholar
  22. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (pp. 1137–1143). San Francisco, CA: Morgan Kaufmann.Google Scholar
  23. Kohavi, R. & Wolpert, D. (1996). Bias plus variance decomposition for zero-one loss functions. Proceedings of the Thirteenth International Conference on Machine Learning (pp. 275–283). San Francisco, CA: Morgan Kaufmann.Google Scholar
  24. Liu, W. & White, A. (1994). The importance of attribute selection measures in decision tree induction. Machine Learning, 15, 25–41.Google Scholar
  25. Miller, R. (1981). Simultaneous Statistical Inference (2nd edition). New York, NY: Springer-Verlag.Google Scholar
  26. Mingers, J.(1989a).An empirical comparison of pruning methods for decision tree induction.Machine Learning,4,227–243.Google Scholar
  27. Mingers, J.(1989b).Anempirical comparison of selection measures for decision-tree induction.Machine Learning,3, 319–342.Google Scholar
  28. Morgan, J. & Andrews, F. (1973).A comment on Einhorn's “Alchemy in the behavioral sciences”.Public Opinion Quarterly, 37,127–129.Google Scholar
  29. Murthy, S. & Salzberg, S. (1995). Lookahead and pathology in decision tree induction. IJCAI: Proceedings of Fourteenth International Joint Conference on Artificial Intelligence (pp. 1025–1031). San Francisco, CA: Morgan Kaufmann.Google Scholar
  30. Noreen, E. (1989). Computer-Intensive Methods for Testing Hypotheses: An Introduction. New York, NY: Wiley.Google Scholar
  31. Oates, T. & Jensen, D. (1997). The effects of training set size on decision tree complexity. Proceedings of the Fourteenth International Conference on Machine Learning (pp. 254–262). San Francisco, CA: Morgan Kaufmann.Google Scholar
  32. Pearl, J. (1978). On the connection between the complexity and credibility of inferred models. International Journal of General Systems, 4, 255–264.Google Scholar
  33. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.Google Scholar
  34. Quinlan, J. R. (1987). Simplifying decision trees. International Journal of Man-Machine Studies, 27, 221–234.Google Scholar
  35. Quinlan, J. R. (1988). Decision trees and multi-valued attributes. In J. Hayes, D. Michie & J. Richards (Eds.), Machine Intelligence (Vol. 11). Oxford, England: Clarendon Press.Google Scholar
  36. Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77–90.Google Scholar
  37. Quinlan, J. R. & Cameron-Jones, R. (1995). Oversearching and layered search in empirical learning. IJCAI: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (pp. 1019–1024). San Francisco, CA: Morgan Kaufmann.Google Scholar
  38. Quinlan, J. R. & Rivest, R. (1989). Inferring decision trees using the minimum description length principle. Information and Computation, 80, 227–248.Google Scholar
  39. Rao, R., Gordon, D., & Spears, W. (1995). For every generalization action, is there really an equal and opposite reaction? Analysis of the conservation law for generalization performance. Machine Learning: Proceedings of the Twelfth International Conference (pp. 471–479). San Francisco, CA: Morgan Kaufmann.Google Scholar
  40. Ross, S. (1984). A First Course in Probability (2nd edition). New York, NY: Macmillan.Google Scholar
  41. Salzberg, S. (1997). On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1, 317–328.Google Scholar
  42. Schaffer, C. (1993). Overfitting avoidance as bias. Machine Learning, 10, 153–178.Google Scholar
  43. Schaffer, C. (1994). A conservation law for generalization performance. Proceedings of the Eleventh International Conference on Machine Learning (pp. 259–265). San Francisco, CA: Morgan Kaufmann.Google Scholar
  44. Sonquist, J., Baker, E., & Morgan, J. (1971). Searching for Structure (Alias, AID-III); An Approach to Analysis of Substantial Bodies of Micro-Data and Documentation for a Computer Program (Successor to the Automatic Interaction Detector Program). Ann Arbor, MI: Survey Research Center, Institute for Social Research, The University of Michigan.Google Scholar
  45. Weiss, S. & Kulikowski, C. (1991). Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. San Mateo, CA: Morgan Kaufmann.Google Scholar
  46. White, A. & Liu, W. (1995). Superstitious learning and induction. Artificial Intelligence Review, 9, 3–18.Google Scholar
  47. Wolpert, D.(1992).On the connection between in-sample testing and generalization error.Complex Systems,6,47–94.Google Scholar
  48. Wolpert, D.(1994).Off-training set error and a priori distinctions between learning algorithms.Technical Report SFI TR 95–01–003. Santa Fe, NM: Santa Fe Institute.Google Scholar

Copyright information

© Kluwer Academic Publishers 2000

Authors and Affiliations

  • David D. Jensen
    • 1
  • Paul R. Cohen
    • 2
  1. 1.Experimental Knowledge Systems Laboratory, Department of Computer ScienceUniversity of MassachusettsAmherstUSA
  2. 2.Experimental Knowledge Systems Laboratory, Department of Computer ScienceUniversity of MassachusettsAmherstUSA

Personalised recommendations