Advertisement

Data Mining pp 193-226 | Cite as

The Impact of Small Disjuncts on Classifier Learning

  • Gary M. Weiss
Chapter
Part of the Annals of Information Systems book series (AOIS, volume 8)

Abstract

Many classifier induction systems express the induced classifier in terms of a disjunctive description. Small disjuncts are those that classify few training examples. These disjuncts are interesting because they are known to have a much higher error rate than large disjuncts and are responsible for many, if not most, of all classification errors. Previous research has investigated this phenomenon by performing ad hoc analyses of a small number of data sets. In this chapter we provide a much more systematic study of small disjuncts and analyze how they affect classifiers induced from 30 real-world data sets. A new metric, error concentration, is used to show that for these 30 data sets classification errors are often heavily concentrated toward the smaller disjuncts. Various factors, including pruning, training set size, noise, and class imbalance are then analyzed to determine how they affect small disjuncts and the distribution of errors across disjuncts. This analysis provides many insights into why some data sets are difficult to learn from and also provides a better understanding of classifier learning in general.We believe that such an understanding is critical to the development of improved classifier induction algorithms.

Keywords

Error Rate Class Distribution High Error Rate Minority Class Class Imbalance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ali, K.M., Pazzani, M.J.: Reducing the small disjuncts problem by learning probabilistic concept Descriptions. In: Petsche, T. (ed.) Computational Learning Theory and Natural Learning Systems, Volume 3, MIT Press, Cambridge, MA (1992)Google Scholar
  2. 2.
    Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Science. http://www.ics.uci.edu/mlearn/MLRepository.html. Cited Sept 2008
  3. 3.
    Carvalho D.R., Freitas A.A.: A hybrid decision tree/genetic algorithm for coping with the problem of small disjuncts in data mining. In: Proceedings of the 2000 Genetic and Evolutionary Computation Conference, pp. 1061–1068 (2000)Google Scholar
  4. 4.
    Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357 (2002)Google Scholar
  5. 5.
    Chawla N.V., Cieslak D.A., Hall L.O., Joshi A.: Automatically countering imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery, 17(2), 225–252 (2008)CrossRefGoogle Scholar
  6. 6.
    Cohen W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123 (1995)Google Scholar
  7. 7.
    Cohen W., Singer Y.: A simple, fast, and effective rule learner. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 335–342 (1999)Google Scholar
  8. 8.
    Danyluk A.P., Provost F.J.: Small disjuncts in action: learning to diagnose errors in the local loop of the telephone network. In: Proceedings of the Tenth International Conference on Machine Learning, pp. 81–88 (1993)Google Scholar
  9. 9.
    Holte R.C., Acker L.E., Porter B.W.: Concept learning and the problem of small disjuncts. In: Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pp. 813–818 (1989)Google Scholar
  10. 10.
    Japkowicz N., Stephen S.: The class imbalance problem: a systematic study. Intelligent Data Analysis 6(5), 429–450 (2002)Google Scholar
  11. 11.
    Jo T., Japkowicz, N. Class imbalances versus small disjuncts. SIGKDD Explorations 6(1), 40–49 (2004)CrossRefGoogle Scholar
  12. 12.
    Quinlan J.R.: The effect of noise on concept learning. In: Michalski R.S., Carbonell J.G., Mitchell T.M. (eds.), Machine Learning, an Artificial Intelligence Approach, Volume II, Morgan Kaufmann, San Francisco, CA (1986)Google Scholar
  13. 13.
    Quinlan J.R.: Technical note: improved estimates for the accuracy of small disjuncts. Machine Learning, 6(1) (1991)Google Scholar
  14. 14.
    Quinlan J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA (1993)Google Scholar
  15. 15.
    Ting K.M.: The problem of small disjuncts: its remedy in decision trees. In: Proceedings of the Tenth Canadian Conference on Artificial Intelligence, pp. 91–97 (1994)Google Scholar
  16. 16.
    Van den Bosch A., Weijters A., Van den Herik H.J., Daelemans W.: When small disjuncts abound, try lazy learning: A case study. In: Proceedings of the Seventh Belgian-Dutch Conference on Machine Learning, pp. 109–118 (1997)Google Scholar
  17. 17.
    Weiss G.M.: Learning with rare cases and small disjuncts. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 558–565 (1995)Google Scholar
  18. 18.
    Weiss G.M.: Mining with rarity: A unifying framework, SIGKDD Explorations 6(1), 7–19 (2004)CrossRefGoogle Scholar
  19. 19.
    Weiss G.M., Hirsh H.: The problem with noise and small disjuncts. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 574–578 (1998)Google Scholar
  20. 20.
    Weiss G.M., Hirsh H.: A quantitative study of small disjuncts. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence, Austin, Texas, pp. 665–670 (2000)Google Scholar
  21. 21.
    Weiss G.M., McCarthy K., Zabar B.: Cost-Sensitive Learning vs. Sampling: Which is best for handling unbalanced classes with unequal error costs? In: Proceedings of the 2007 International Conference on Data Mining, pp. 35–41 (2007)Google Scholar
  22. 22.
    Weiss G.M., Provost F.: Learning when training data are costly: the effect of class distribution on tree induction. Journal of AI Research 19, 315–354 (2003)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Fordham UniversityBronxUSA

Personalised recommendations