Empirical Software Engineering

, Volume 6, Issue 1, pp 59–79

Controlling Overfitting in Classification-Tree Models of Software Quality

  • Taghi M. Khoshgoftaar
  • Edward B. Allen

DOI: 10.1023/A:1009803004576

Cite this article as:
Khoshgoftaar, T.M. & Allen, E.B. Empirical Software Engineering (2001) 6: 59. doi:10.1023/A:1009803004576


Predictingwhich modules are likely to have faults during operations isimportant to software developers, so that software enhancementefforts can be focused on those modules that need improvementthe most. Modeling software quality with classification treesis attractive because they readily model nonmonotonic relationships.In this paper, we apply the TREEDISCalgorithm which is a refinement of the CHAID algorithmto build classification-tree models. Chaid-based algorithmsdiffer from other classification-tree algorithms in their relianceon chi-squared tests when building the tree. Classification-treemodels are vulnerable to overfitting, where the model reflectsthe structure of the training data set too closely. Even thougha model appears to be accurate on training data, if overfitted,it may be much less accurate when applied to a current data set.To account for the severe consequences of misclassifying fault-pronemodules, our measure of overfitting is based on expected costsof misclassification, rather than the total number of misclassifications.We conducted a case study of a very large telecommunicationssystem. A two-way analysis of variance with repetitions foundthat TREEDISC's significance level was highly relatedto overfitting, and can be used to control it. Moreover, theminimum number of modules in a leaf also influenced the degreeof overfitting.

Empirical studiessoftware metricssoftware qualityfault-prone modulesclassification treesTREEDISCCHAID

Copyright information

© Kluwer Academic Publishers 2001

Authors and Affiliations

  • Taghi M. Khoshgoftaar
    • 1
  • Edward B. Allen
    • 2
  1. 1.Florida Atlantic UniversityBoca RatonUSA
  2. 2.Mississippi State UniversityUSA