The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining

Pham, Huy Nguyen Anh; Triantaphyllou, Evangelos

doi:10.1007/978-0-387-69935-6_16

Huy Nguyen Anh Pham³ &
Evangelos Triantaphyllou³

1731 Accesses
14 Citations
3 Altmetric

Many classification studies often times conclude with a summary table which presents performance results of applying various data mining approaches on different datasets. No single method outperforms all methods all the time. Furthermore, the performance of a classiffication method in terms of its false-positive and false-negative rates may be totally unpredictable. Attempts to minimize any of the previous two rates, may lead to an increase on the other rate. If the model allows for new data to be deemed as unclassifiable when there is not adequate information to classify them, then it is possible for the previous two error rates to be very low but, at the same time, the rate of having unclassifiable new examples to be very high. The root to the above critical problem is the overfitting and overgeneralization behaviors of a given classification approach when it is processing a particular dataset. Although the above situation is of fundamental importance to data mining, it has not been studied from a comprehensive point of view. Thus, this chapter analyzes the above issues in depth. It also proposes a new approach called the HomogeneityBased Algorithm (or HBA) for optimally controlling the previous three error rates. This is done by first formulating an optimization problem. The key development in this chapter is based on a special way for analyzing the space of the training data and then partitioning it according to the data density of different regions of this space. Next, the classification task is pursued based on the previous partitioning of the training space. In this way, the previous three error rates can be controlled in a comprehensive manner. Some preliminary computational results seem to indicate that the proposed approach has a significant potential to fill in a critical gap in current data mining methodologies.

Key words: classification, prediction, overfitting, overgeneralization, falsepositive, false-negative, homogenous set, homogeneity degree, optimization

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abdi, H., (2003), “A neural network primer,” Journal of Biological Systems, vol. 2, pp. 247-281.
Article Google Scholar
Ali, K., C. Brunk, and M. Pazzani, (1994), “On learning multiple descriptions of a concept,” Proceedings of Tools with Artificial Intelligence, New Orleans, LA, USA, pp. 476-483.
Google Scholar
Artificial Neural Network Toolbox 6.0 and Statistics Toolbox 6.0, Matlab Version 7.0, website: http://www.mathworks.com/products/
Boros, E., P. L. Hammer, and J. N. Hooker, (1994), “Predicting Cause-Effect Relationships from Incomplete Discrete Observations,” Journal on Discrete Mathematics, vol. 7, no. 4, pp. 531-543.
Article MATH MathSciNet Google Scholar
Bracewell, R., (1999), “The Impulse Symbol,” Chapter 5 in The Fourier Transform and Its Applications, 3rd ed. New York: McGraw-Hill, pp. 69-97.
Google Scholar
Breiman, L., (1996), “Bagging predictors,” Journal of Machine Learning, vol. 24, pp. 123-140.
MATH MathSciNet Google Scholar
Breiman, L., (2001), ”Random forests,” Journal of Machine Learning, vol. 45, no. 1, pp. 5-32.
Article MATH Google Scholar
Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone, (1984), “Classification and Regression Trees,” Chapman Hall/CRC Publisher, pp. 279-293.
Google Scholar
Byvatov, E., and G. Schneider, (2003), “Support vector machine applications in bioinformatics,” Journal of Application Bioinformatics, vol. 2, no.2, pp. 67-77.
Google Scholar
Clark, P., and R. Boswell, (1991), “Rule induction with CN2: Some recent improvements,” Y. Kodratoff, editor, Machine Learning - EWSL-91, Berlin, SpringerVerlag, pp. 151-163.
Chapter Google Scholar
Clark, P., and T. Niblett, (1989), “The CN2 Algorithm,” Journal of Machine Learning, vol. 3, pp. 261-283.
Google Scholar
Cohen S., L. Rokach, O. Maimon, (2007), “Decision-tree instance-space decomposition with grouped gain-ratio,”, Information Science, Volume 177, Issue 17, pp. 3592-3612.
Article Google Scholar
Cohen, W. W., (1995), “Fast effective rule induction,” Machine Learning: Proceedings of the Twelfth International Conference, Tahoe City, CA., USA, pp. 115-123.
Google Scholar
Cortes, C., and V. Vapnik, (1995), “Support-vector networks,” Journal of Machine Learning, vol. 20, no. 3, pp. 273-297.
MATH Google Scholar
Cover, T. M., and P. E. Hart, (1967), “Nearest Neighbor Pattern Classification,” Institute of Electrical and Electronics Engineers Transactions on Information Theory, vol. 13, no. 1, pp. 21-27.
MATH Google Scholar
Cristianini, N., and S. T. John, (2000), “An Introduction to Support Vector Machines and other kernel-based learning methods,” Cambridge University Press.
Google Scholar
Dasarathy, B. V., and B. V. Sheela, (1979), “A Composite Classifier System Design: Concepts and Methodology,” Proceedings of the IEEE, vol. 67, no. 5, pp. 708-713.
Article Google Scholar
Dietterich, T. G., and G. Bakiri, (1994), “Solving multiclass learning problems via error-correcting output codes,” Journal of Artificial Intelligence Research, vol. 2, pp. 263-286.
Google Scholar
Duda, R. O., and P. E. Hart, (1973), “Pattern Classification and Scene Analysis,” Wiley Publisher, pp. 56-64.
Google Scholar
Duda. O. R., E. H. Peter, G. S. David , (2001), “Pattern Classification,” Chapter 4: Nonparametric Techniques in Wiley Interscience Publisher, pp. 161-199.
Google Scholar
Dudani, S., (1976), “The Distance-Weighted k-Nearest-Neighbor Rule,” IEEE Transactions on Systems, Man and Cybernetics, vol. 6, no. 4, pp. 325-327.
Google Scholar
Friedman, N., D. Geiger, and M. Goldszmidt, (1997), “Bayesian Network Classifiers,” Journal of Machine Learning, vol. 29, pp. 131-161.
Article MATH Google Scholar
Geman, S., E. Bienenstock, and R. Doursat, (1992), “Neural Networks and the Bias/Variance Dilemma,” Journal of Neural Computation, vol. 4, pp. 1-58.
Article Google Scholar
Hecht-Nielsen, R., (1989), “Theory of the Backpropagation neural Network,” International Joint Conference on neural networks, Washington, DC, USA, pp. 593-605.
Google Scholar
Huzefa, R., and G. Karypis, (2005), “Profile Based Direct Kernels for Remote Homology Detection and Fold Recognition,” Journal of Bioinformatics, vol. 31, no. 23, pp. 4239-4247.
Google Scholar
Karp, R. M., (1972), “Reducibility Among Combinatorial Problems,” Proceedings of Sympos. IBM Thomas J. Watson Res. Center, Yorktown Heights, New York: Plenum, pp. 85-103.
Google Scholar
Keller, J. M., M. R. Gray, and J. A. Givens, Jr, (1985), “A Fuzzy K-Nearest Neighbor Algorithm,” Journal of IEEE Transactions on Systems, Man, and Cybernetics, vol. 15, no. 4, pp. 580-585.
Google Scholar
Kohavi R., (1996), “Scaling up the accuracy of naive-Bayes classifiers: a decisiontree hybrid,” Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, pp. 202-207.
Google Scholar
Kohavi, R., and G. John, (1997), “Wrappers for Feature Subset Selection,” Journal of Artificial Intelligence: special issue on relevance, vol. 97, no. 1-2, pp. 273-324.
MATH Google Scholar
Kokol, P., M. Zorman, M. M. Stiglic, and I. Malcic, (1998), “The limitations of decision trees and automatic learning in real world medical decision making,” Proceedings of the 9th World Congress on Medical Informatics MEDINFO’98, vol. 52, pp. 529-533.
Google Scholar
ıve Bayesian classifier,” Y. Kodratoff Editor, Proceedings of sixth European working session on learning, Springer-Verlag, pp. 206-219.
Google Scholar
Kwok, S., and C. Carter, (1990), “Multiple decision trees: uncertainty,” Journal of Artificial Intelligence, vol.4, pp. 327-335.
Google Scholar
Langley, P., and S. Sage, (1994), “Induction of Selective Bayesian Classifiers,” Proceedings of UAI-94, Seattle, WA, USA, pp. 399-406.
Google Scholar
Mansour, Y., D. McAllester, (2000), “Generalization Bounds for Decision Trees,” Proceedings of the 13th Annual Conference on Computer Learning Theory, San Francisco, Morgan Kaufmann, USA, pp. 69-80.
Google Scholar
Moody, J. E., (1992), “The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems,” Journal of Advances in Neural Information Processing Systems, vol. 4, pp. 847-854.
Google Scholar
Nock, R., and O. Gascuel, (1995), “On learning decision committees,” Proceedings of the Twelfth International Conference on Machine Learning, Morgan Kaufmann, Taho City, CA., USA, pp. 413-420.
Google Scholar
Oliver, J. J., and D. J.Hand, (1995), “On pruning and averaging decision trees,” Proceedings of the Twelfth International Conference on Machine Learning, Morgan Kaufmann, Taho City, CA., USA, pp. 430-437.
Google Scholar
Pazzani, M.J., (1995), “Searching for dependencies in Bayesian classifiers,” Proceedings of AI STAT’95, pp. 239-248.
Google Scholar
Podgorelec, V., P. Kokol, B. Stiglic, and I. Rozman, (2002), “Decision trees: an overview and their use in medicine,” Journal of Medical Systems, Kluwer Academic/Plenum Press, vol. 26, no. 5, pp. 445-463
Google Scholar
Quinlan, J. R., (1987), “Simplifying decision trees,” International Journal of ManMachine Studies, vol. 27, pp. 221-234.
Article Google Scholar
Quinlan, J. R., (1993), “C4.5: Programs for Machine Learning,” Morgan Kaufmann Publisher San Mateo, CA., USA, pp. 35-42.
Google Scholar
Rada, M., (2004), “Seminar on Machine Learning,” a presentation of a course taught at University of North Texas.
Google Scholar
Rokach L., O. Maimon, O. Arad, (2005), “Improving Supervised Learning by Sample Decomposition,” Journal of Computational Intelligence and Applications, vol. 5, no. 1, pp. 37-54.
Article Google Scholar
Sands D., (1998), “Improvement theory and its applications,” Gordon A. D., and A. M. Pitts Editors, Higher Order Operational Techniques in Semantics, Publications of the Newton Institute, Cambridge University Press, pp. 275-306.
Google Scholar
Schapire, R. E, (1990), “The strength of weak learnability,” Journal of Machine Learning, vol. 5, pp. 197-227.
Google Scholar
Shawe-Taylor. J., and C. Nello, (1999), “Further results on the margin distribution,” Proceedings of COLT99, Santa Cruz, CA., USA, pp. 278-285.
Google Scholar
Smith, M., (1996), “Neural Networks for Statistical Modeling,” Itp New Media Publisher, ISBN 1-850-32842-0, pp. 117-129.
Google Scholar
Spizer, M., L. Stefan, C. Paul, S. Alexander, and F. George, (2006), “IsoSVM Distinguishing isoforms and paralogs on the protein level,” Journal of BMC Bioinformatics, vol. 7:110, website: http://www.biomedcentral.com/content/pdf/1471-2105-7-110.pdf.
Tan, P. N., S. Michael, and K. Vipin, (2005), “Introduction to Data Mining,” Chapters 4 and 5, Addison-Wesley Publisher, pp. 145-315.
Google Scholar
Triantaphyllou, E., (2007), “Data Mining and Knowledge Discovery Via a Novel Logic-Based Approach,” A monograph, Springer, Massive Computing Series, 420 pages, (in print).
Google Scholar
Triantaphyllou, E., and G. Felici, (Editors), (2006), “Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques,” Springer, Massive Computing Series, 796 pages.
Google Scholar
Triantaphyllou, E., L. Allen, L. Soyster, and S. R. T. Kumara, (1994), “Generating Logical Expressions From Positive and Negative Examples via a Branch-andBound approach,” Journal of Computers and Operations Research, vol. 21, pp. 783-799.
Google Scholar
Vapnik, V., (1998), “Statistical Learning Theory,” Wiley Publisher, pp. 375-567.
Google Scholar
Webb, G. I., (1996), “Further experimental evidence against the utility of Occam’s razor,” Journal of Artificial Intelligence Research, vol. 4, pp. 397-417.
MATH Google Scholar
Webb, G. I., (1997), “Decision Tree Grafting,” Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI’97), vol. 2, pp. 23-29.
Google Scholar
Weigend, A., (1994), “On overfitting and the effective number of hidden units,” Proceedings of the 1993 Connectionist Models Summer School, pp. 335-342.
Google Scholar
Wikipedia Dictionary, (2007), website: http://en.wikipedia.org/wiki/Homogenous.
Wolpert, D. H, (1992), “Stacked generalization,” Journal of Neural Networks, vol. 5, pp. 241-259.
Article Google Scholar
Zavrsnik, J., P. Kokol, I. Maleiae, K. Kancler, M. Mernik, and M. Bigec, (1995), “ROSE: decision trees, automatic learning and their applications in cardiac medicine,” MEDINFO’95, Vancouver, Canada, pp. 201-206.
Google Scholar
Zhou Z. and C. Chen, (2002), “Hybrid decision tree,” Journal of Knowledge-Based Systems, vol. 15, pp. 515-528.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Louisiana State University, 298 Coates Hall, 70803, Baton Rouge, LA, USA
Huy Nguyen Anh Pham & Evangelos Triantaphyllou

Authors

Huy Nguyen Anh Pham
View author publications
You can also search for this author in PubMed Google Scholar
Evangelos Triantaphyllou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Tel Aviv University, 69978, Tel Aviv, Israel
Oded Maimon
Ben-Gurion University, 84105, Beer-Sheva, Israel
Lior Rokach

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Pham, H.N.A., Triantaphyllou, E. (2008). The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining. In: Maimon, O., Rokach, L. (eds) Soft Computing for Knowledge Discovery and Data Mining. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-69935-6_16

Download citation

DOI: https://doi.org/10.1007/978-0-387-69935-6_16
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-69934-9
Online ISBN: 978-0-387-69935-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics