1 Introduction

Accurate estimation of which modules are faulty in a software system can be very useful to software practitioners and researchers. Practitioners can efficiently allocate scarce resources if they can predict which modules may need to undergo more extensive Verification and Validation than others. Researchers need to use quantitative, accurate module defect prediction techniques so they can assess and subsequently improve software development methods. In this paper, by the term “module,” we denote any piece of software (e.g., routine, method, class, package, subsystem, system).

Several techniques have been proposed and applied in the literature for estimating whether a module is faulty (Beecham et al. 2010a, b; Hall et al. 2012; Malhotra 2015; Radjenović et al. 2013). We focus on those techniques that define defect prediction models (i.e., binary classifiers (Fawcett 2006)) by setting a threshold t on a defect proneness model (Huang et al. 2019), i.e., a scoring classifier that uses a set of independent variables. For instance, if the defect proneness model computes the probability that a module is faulty, a defect prediction model estimates a module faulty if its probability of being faulty is above or equal to t. The issue of defining the value of t has been addressed by several approaches in the literature (for instance, Alves et al. (2010), Erni and Lewerentz (1996), Morasca and Lavazza (2017), Schneidewind (2001), Shatnawi (2010), and Tosun and Bener (2009)).

The selection of t may greatly influence the estimates and the performance of the resulting defect prediction model. Thus, to evaluate a defect proneness model, one should evaluate the performance of the entire set of defect prediction models obtained with all possible values of t. Receiver Operating Characteristic (ROC) curves (see Section 4) have long been used to this end.

However, it is unlikely that all possible threshold values are used in practice. Suppose you have a defect proneness model for critical applications. It is unlikely that any sensible stakeholder selects a value of t corresponding to a high risk (e.g., to a 0.8 probability) of releasing a faulty module. Also, practitioners may not be able to confidently choose a “sharp” t value, corresponding to an exact probability value like 0.1. Instead, they may have enough information to know that the value of t should correspond to a risk level around 0.1, e.g., between 0.05 and 0.15. So, the evaluation should be restricted only to those defect prediction models that may be really used, depending on the goals and needs of individual practitioners.

In addition, some values of t are not useful because of two general reasons that do not depend on any specific practitioners’ or researchers’ goals and needs, hence they hold for every evaluation of defect prediction models.

First, suppose that a defect prediction model based on a set of independent variables and built with a given t does not perform better than a basic, reference defect prediction model that does not use any information from independent variables. Then, the defect prediction model should not be used, because one would be better off by using the simpler reference model. In other words, t is not an adequate threshold value for the defect proneness model.

Second, the prediction models obtained as t varies have different performance, but one should use in practice only those that perform well enough, based on some definition of performance and its minimum acceptable level.

So, in general, it may be ineffective and even misleading to evaluate the defect prediction models built with all possible values of t.

The goal of this paper is to propose an approach to assessing a given defect proneness model. We show how to use ROC curves and reference models to identify the defect prediction models that are worth using because they perform well enough for practical use and outperform reference ones according to (1) standard performance metrics and (2) cost. Thus, we identify the values of t for which it is worthwhile to build and use defect prediction models. Our empirical validation shows the extent of the differences in the assessment of defect prediction models between our method and the traditional one.

Our approach helps practitioners compare defect prediction models and select those useful for their goals and needs. It allows researchers to assess software development techniques based only on those defect prediction models that should be used in practice, and not on irrelevant ones that may bias the results and lead to misleading conclusions.

Here are the main contributions of our proposal.

  • We introduce a new performance metric, the Ratio of Relevant Areas (RRA). RRA can take into account only those parts of the ROC curve corresponding to thresholds for which it is worthwhile to build defect prediction models, i.e., the defect prediction models perform well enough according to some specified notion of performance. We also show how RRA can be customized to address the specific needs of different practitioners.

  • We show that the Area Under the Curve (AUC) and Gini’s coefficient (G) (Gini 1912) and other proposals are special cases of RRA, which, however, account for parts of the ROC curve corresponding to thresholds for which it is not worthwhile to build defect prediction models.

  • We show how cost can be taken into account. We also provide an inequality that should be satisfied by all defect prediction models, regardless of the way they are built and of the specific misclassification costs.

  • We show that choosing a performance metric (like Precision, Recall, etc.) for the assessment of defect prediction models is not simply a theoretical decision, but it equates to choosing a specific cost model.

We would like to clarify upfront that, in this paper, we are not interested in building the best performing models possible. Metrics like AUC, G, and RRA are used to assess existing, given models. We build models simply because we need them to demonstrate how our proposal works. To this end, we use 67 datasets from the SEACRAFT repository (https://zenodo.org/communities/seacraft). These datasets contain data about a number of measures and the faultiness of the modules belonging to real-life projects. At any rate, in the empirical validation illustrated in Section 10, we build defect proneness models that use all of the available measures so as to maximize the use of the available information about the modules’ characteristics and possibly model performance as well.

The remainder of this paper is organized as follows. Section 2 recalls basic concepts of Software Defect Prediction along with the performance metrics that we use. Section 3 introduces the reference policies against which defect prediction models are evaluated. Sections 4 and 5 summarize the fundamental concepts underlying ROC curves and a few relevant issues. We show how to delimit the values of t that should be in general taken into account and we introduce RRA in Section 6. We show how RRA can be used based on several performance metrics in Section 7 and based on cost in Section 8. Section 9 compares RRA to AUC and G. We empirically demonstrate our approach on datasets from real-life applications in Section 10 and also highlight the insights that RRA can provide and traditional metrics can not. Section 11 illustrates the usefulness of our approach in Software Engineering practice. Threats to the validity of the empirical study are discussed in Section 12. Section 13 discusses related work. The conclusions and an outline for future research are in Section 14. Appendices AF contain further details on some mathematical aspects of the paper.

2 Software Defect Prediction

The investigation of software defect prediction is carried out by first learning a defect prediction model (i.e., a binary classifier (Fawcett 2006)) on a set of data called the training set, and then evaluating its performance on a set of previously unseen data, called the test set. By borrowing from other disciplines, we use the labels “positive” for “faulty module” and “negative” for “non-faulty module”. We denote by \(\underline {z} = \langle z_{1}, z_{2}, {\ldots } \rangle \) the set of independent variables (i.e., features) used by a defect prediction model. Also, m will denote a module and we write ‘m’ for short, instead of writing “a module” or “a module m”.

We use defect prediction models fn\((\underline {z}\),t) built by setting a threshold t on a defect proneness model fp\((\underline {z})\), so that fn(z,t) = positivetfp(z) and fn(z,t) = negativefp(z)< t. For instance, we can use a Binary Logistic Regression (BLR) (Hardin and M 2002; Hosmer et al. 2013) model, which gives the probability that m is positive, to build a defect prediction model by setting a probability threshold t. Different values of t may lead to different classifiers, with different performance.

The performance of a defect prediction model can be assessed based on a confusion matrix, whose schema is shown in Table 1.

Table 1 A confusion matrix

Table 2 shows the performance metrics we use in this paper. They include some of the most used ones and provide a comprehensive idea about the performance of a defect prediction model. The first three columns of Table 2 provide the name, the definition formula, and a concise explanation of the purpose of a metric. The other two columns are explained in Section 3.1.

Table 2 Performance metrics for confusion matrices

Specifically, Precision, Recall, and FM (i.e., the F-measure or F-score (van Rijsbergen 1979)) assess performance with respect to the positive class, while Negative Predictive Value (NPV ), Specificity, and a new metric that we call “Negative-F-measure” (NM), which mirrors FM, assess performance with respect to the negative class. Youden’s J (Youden 1950) (also known as “Informedness”), Markedness (Powers 2012), and ϕ (also known as Matthews Correlation Coefficient (Matthews 1975)), which is the geometric mean of J and Markedness, are overall performance metrics.

Other metrics can be used as well. At any rate, using different metrics does not affect the way our approach works.

3 Defining Reference Performance Values

The performance of the application of a defect prediction model on a dataset can be assessed by comparing the values obtained for one or more metrics against specified reference values, which set minimal performance standards. Classifiers not meeting these minimal standards should be discarded.

We use two methods to set reference values of performance metrics for defect prediction models: (1) methods based on random policies (see Section 3.1) and (2) deterministic ones (see Section 3.2). In both cases, our goal is to define reference values whose computation does not use any information about the individual modules.

3.1 Methods Based on Random Software Defect Prediction Policies

A random software defect prediction policy associates module m with a probability p(m) that m is estimated positive. Thus, a random policy does not define a single defect prediction model, which deterministically labels each m. Rather, we have a set of defect prediction models, each of which is the result of “tossing a set of coins,” each with probability p(m) for each m.

The performance of a random policy can be evaluated based on the values of TP, TN, FP, and FN that can be expected, i.e., via an expected confusion matrix in which each cell contains the expected value of the metric in the cell. These expected confusion matrices have already been introduced in Morasca (2014) for estimating the number of faults in a set of modules. The same metrics of Table 2 can be defined for random policies, by using the cells of the expected confusion matrix, e.g., Precision for a random policy is computed as \(\frac {\mathbb {E}[TP]}{\mathbb {E}[EP]}\), where \(\mathbb {E}[TP]\) and \(\mathbb {E}[EP]\) are the expected values of TP and EP, respectively..

Our goal is to define random policies that use no information about m. Lacking any information about the specific characteristics of m, all modules must be treated alike. Thus, each m must be given the same probability p to be estimated positive, i.e., a uniform random policy must be used.

We use uniform random policies as references (see Sections 6 and 7) against which to evaluate the performance of defect prediction models. A classifier that performs worse than what can be expected of a uniform random policy should not be used. Our proposal for reference policies is along the lines of Langdon et al. (2016), Lavazza and Morasca (2017), and Shepperd and MacDonell (2012), where a random approach is used for defining reference effort estimation models. Also, completely randomized classifiers were used as reference models for defect prediction in the empirical studies of (Herbold et al. 2018) and in the application examples of Khoshgoftaar et al. (2001), though the performance of random classifiers was not taken into account in the definition of any performance metric. In what follows, we use “uni” to denote the uniform random policy for some specified value of p.

The expected confusion matrix for random policies depends on p. We use the value of p that sets the hardest challenge for models using variables, i.e., its Maximum Likelihood estimate \(p = \frac {AP}{n}\). This special case of uni, which we call “pop” (as in “proportion of positives”), is one of the reference policies used in the empirical study of Section 10.

Table 3 shows the values of the cells of an expected confusion matrix for uni and pop policies.

Table 3 Expected confusion matrix for uni and pop policies

In what follows, we write \(k = \frac {AN}{AP}\) (following Flach 2003) to summarize properties of the underlying dataset. For instance, instead of \(\frac {AP}{n}\), we write \(\frac {1}{1+k}\).

Note that ϕuni = ϕpop = 0 (and Juni = Jpop = Markednessuni = Markednesspop = 0), so uni and pop are never associated with the defectiveness of modules in any dataset. Thus, uni, and especially pop, can be used as reference policies in the evaluation of defect prediction models.

3.2 Deterministic Methods

We set a deterministic reference value for ϕ, which is the best known of the three overall performance metrics of Table 2, i.e., J, Markedness, and ϕ. The reference value of ϕ for the uni and pop policies is zero, which sets too low a standard against which to compare the ϕ values of defect prediction models. Any model that provides a positive association between a defect prediction model and actual labels of modules, no matter how small, would be considered better than the standard. In our empirical study, we select ϕ = 0.4 as a reference value for ϕ for medium/strong association, as it is halfway between the medium (ϕ = 0.3) and strong (ϕ = 0.5) values indicated in Cohen (1988).

We do not select deterministic reference values for the first six metrics in Table 2 because we can already set reference values based on the pop policy.

4 ROC: Basic Definitions and Properties

A Receiver Operating Characteristic (ROC) curve (Fawcett 2006) is a graphical plot that illustrates the diagnostic ability of a binary classifier fn\((\underline {z}\),t) with a scoring function fp\((\underline {z})\) as its discrimination threshold t varies.

A ROC curve—which we denote as a function ROC(x) —plots the values of y = Recall \(= \frac {TP}{AP}\) against the values of x = Fall-out \(=\frac {FP}{AN}\)= 1–Specificity computed on a test set for all the defect prediction models fn\((\underline {z}\),t) obtained by using all possible threshold values t. Examples of ROC curves are in Fig. 1.

Fig. 1
figure 1

ROC curves of defect proneness models from the berek dataset

The [0,1] × [0,1] square to which a ROC curve belongs is called the ROC space (Fawcett 2006). Given a dataset, each point (x,y) of the ROC space corresponds to a defect prediction model’s confusion matrix, since the values of x and y allow the direct computation of TP and FP and the indirect computation of TN and FN, since AP and AN are known.

The two variables x and y are related to t in a (non-strictly) monotonically decreasing way. Hence, a ROC curve ROC(x) is a non-strictly monotonically increasing function of x.

We now introduce the definition of Upper-left Rectangle of a point (x,y) of the ROC space, which will be used in the remainder of the paper.

Definition 1

Upper-left Rectangle (ULR) of a Point. The Upper-left Rectangle ULR(x,y) of a Point (x,y) is the closed rectangle composed of those points (x’,y’) of the ROC space such that \(x^{\prime } \le x \wedge y^{\prime } \ge y\).

An example of ULR is represented by the highlighted rectangle in Fig. 2, which shows ULR(\(\frac {1}{1+k}, \frac {1}{1+k}\)).

Fig. 2
figure 2

ROC curve with straight lines \(y = Recall_{pop} =\frac {1}{1+k}\) and \(x = Fall{-}out_{pop}=\frac {1}{1+k}\). The highlighted rectangle is ULR\((\frac {1}{1+k},\frac {1}{1+k})\)

All of the points of ULR(x,y) are not worse than (x,y) itself for any sensible performance metric. They correspond to defect prediction models with no more false negatives nor more false positives than the defect prediction model corresponding to (x,y). Point (0,1) has no other point in its ULR(0,1), so it corresponds to the best classifier, which provides perfect estimation.

ROC curves have long been used in Software Defect Prediction (Arisholm et al. 2007; Beecham et al. 2010b; Catal 2012; Catal and Diri 2009; Singh et al. 2010), typically to have an overall evaluation of the performance of the defect prediction models learned based on \(\underline {z}\) with all possible values of t.

The evaluation of fn\((\underline {z}\),t) for all values of t, i.e., the overall evaluation of fp\((\underline {z})\), is typically carried out by computing the Area Under the Curve.

Definition 2

Area Under the Curve (AUC).

The Area Under the Curve is the area below ROC(x) in the ROC space.

The longer ROC(x) lingers close to the left and top sides of the ROC space, the larger AUC. Since the total area of the ROC space is 1, the closer AUC is to 1, the better. Hosmer et al. (2013) propose the intervals in Table 4 as interpretation guidelines for AUC as a measure of how well fn\((\underline {z}\),t) discriminates between positives and negatives for all values of t.

Table 4 Evaluation of AUC and G

When comparing defect proneness models, \(fp_{1}(\underline {z}_{1})\) (associated with AUC1) is preferred to \(fp_{2}(\underline {z}_{2})\) (associated with AUC2) if and only if AUC1 > AUC2 (Hanley and McNeil 1982).

The Gini coefficient G = 2 AUC–1 is a related metric also used for the same purposes (Gini 1912). G takes values in the [0,1] range and was defined in such a way as to be applied only to ROC curves that are never below the diagonal y = x. As there is a one-to-one functional correspondence between them, AUC and G provide the same information. Column “G range” in Table 4 shows how Hosmer at al.’s guidelines for AUC (Hosmer et al. 2013) can be rephrased in terms of G.

Other metrics have been defined in addition to AUC and G. We concisely review some of them in Section 13.1.

5 Evaluation Issues

ROC curves have been widely studied and used in several fields, and a few issues have been pointed out about their definition and evaluation. We concisely review them in Section 13. Here, we focus on two issues about the use of AUC as a sensible way for providing (1) an evaluation of a defect proneness model and (2) a comparison between two defect proneness models.

5.1 Evaluation of a Defect Proneness Model: The Diagonal

The diagonal of the ROC space represents the expected performance of random policies. Table 3 shows that, given a value of p, \(\mathbb {E}[FP] = p \cdot AN\) and \(\mathbb {E}[TP] = p \cdot AP\), so \(\mathbb {E}[x] = \frac {\mathbb {E}[FP]}{AN} = p\) and \(\mathbb {E}[y] = \frac {\mathbb {E}[TP]}{AP} = p\). Thus, \(\mathbb {E}[x] = \mathbb {E}[y] = p\), i.e., the diagonal y = x is the expected ROC curve under a random policy, for each possible value of p. Since the points in ULR(x,y) are not worse than (x,y), the upper-left triangle of the ROC space delimited by the diagonal is also the set of points corresponding to defect prediction models whose performance is not worse than the expected performance of random policies. In practice, the upper-left triangle is the truly interesting part of the ROC space when building useful defect prediction models. It is well-known (Fawcett 2006) that, if a classifier corresponds to a point in the lower-right triangle, a better classifier can be obtained simply by inverting its estimates.

However, AUC is computed by also taking into account the lower-right triangle of the ROC space. Notice that AUC is the area between ROC(x) and the reference ROC curve y = 0, which corresponds to a defect prediction model that estimates TP= 0 for all values of t. In other words, AUC quantifies how different a ROC curve is from this extremely badly performing classifier—even worse than what is expected of any random policy. In practice, instead, AUC is to be compared to random policies, as Table 4 shows.

Using random policies, characterized by y = x, as the reference instead of y = 0 appears more adequate. Instead of the area under the curve in the entire ROC space, one can use the area under the curve in the upper-left triangle, and normalize it by the area of the triangle, i.e., 1/2. This is actually the value of the Gini coefficient G, when ROC(x) is entirely above the diagonal. If that is not the case, one can define a modified ROC curve ROC’(x) that coincides with ROC(x) when ROC(x) is above the diagonal, and coincides with the diagonal otherwise. Practically, this corresponds to using the defect prediction models for all of those values of t in which one obtains better performance than random policies, and random policies otherwise, which are an inexpensive backup estimation technique one can always fall back on.

However, even this modified version of G may not be satisfactory for practitioners’ and researchers’ goals, which may require that one or more performance metrics of a defect prediction model be higher than some specified minimum reference values, and not simply better than a random classifier. The approach proposed in this paper (see Section 6) extends the idea of comparing the performance of models with respect to random policies by taking into account specific performance metrics, and compares their values when they are computed with the defect prediction models obtained based on a defect proneness model and all possible thresholds.

5.2 Comparison of Defect Proneness Models with AUC: Deceiving Cases

Suppose we have two defect proneness models \(fp_{1}(\underline {z})\) and \(fp_{2}(\underline {z})\). Figure 1a shows that the ROC curve of \(fp_{2}(\underline {z})\) is always above the ROC curve of \(fp_{1}(\underline {z})\), i.e., the Recall and Fall-out of \(fp_{2}(\underline {z})\) are never worse than those of \(fp_{1}(\underline {z})\). Accordingly, the AUC of \(fp_{2}(\underline {z})\) is greater than the AUC of \(fp_{1}(\underline {z})\). \(fp_{2}(\underline {z})\) is at least as good as \(fp_{1}(\underline {z})\) for all choices of t, but we do not need AUC to decide which defect proneness model is better.

Figure 1b instead shows two intersecting ROC curves. It is not straightforward to decide which defect proneness model is better by simply looking at the ROC curves, since neither curve dominates the other. Using AUC would have us conclude that \(fp_{1}(\underline {z})\) is not worse than \(fp_{3}(\underline {z})\), since the AUC of \(fp_{1}(\underline {z})\)is greater than the AUC of \(fp_{3}(\underline {z})\). However, Fig. 1b also shows that the AUC of \(fp_{1}(\underline {z})\) is greater than that of \(fp_{3}(\underline {z})\) mainly because the ROC curve of \(fp_{1}(\underline {z})\) is above the ROC curve of \(fp_{3}(\underline {z})\) when FP/AN> 0.6, i.e., when Fall-out is quite high and the defect prediction models obtained with both defect proneness models provide quite bad estimates.

Thus, just because the AUC of a defect proneness model is greater than that of another does not automatically mean that the former defect proneness model is preferable. Instead, we should restrict the comparison to the zone (i.e., the threshold range) where the defect prediction models obtained behave “acceptably”.

In the following sections, we propose methods to “purge” AUC from the noise originated by defect prediction models that do not perform well enough. The resulting indications are expected to be more reliable, hence more useful in practice and also more correct from a theoretical point of view.

6 Evaluation Based on Relevant Areas

Suppose we select a performance metric PFM and a random or deterministic method MTD for selecting a reference value to evaluate the acceptability of a defect prediction model. Let us denote by PFMMTD the reference performance value of MTD when evaluated via PFM. For instance, if we take FM as PFM and pop as MTD, we have FM\(_{pop} = \frac {1}{1+k}\), as shown in Table 2.

Among the defect prediction models that can be generated with fn\((\underline {z}\),t), the ones that should be considered performing sufficiently well are those that provide a better value of PFM than the reference value, e.g., better than what can be expected of a random policy. These are the practically useful defect prediction models, hence the ones that should be taken into account when evaluating the overall performance of fp\((\underline {z})\). For instance, if we decide to assess the performance of fn\((\underline {z}\),t) based on PFM1 = Recall and MTD1 = pop, we should only take into account those values of t in which fn\((\underline {z}\),t) has Recall> Recall\(_{pop} = \frac {1}{1+k}\).

Note that there are always values of t that are so small or so large to make estimates’ performance according to some PFM similar or even worse than the performance with a reference policy. Hence, it does not make sense to evaluate fn\((\underline {z}\),t) for all the values of t. Instead, we take into account the points (x,y) of a ROC curve that satisfy inequality y > Recall\(_{pop} = \frac {1}{1+k}\) (see Table 2), i.e., the points above the horizontal straight line \(y = \frac {1}{1+k}\).

Recall captures one aspect of performance, mainly based on true positives, but other aspects can be of interest. Suppose we decide to assess the performance of fn\((\underline {z}\),t) based on PFM2 = Fall-out, which captures performance by taking into account the false positives, and MTD2 = pop. We should only take into account the values of t in which fn\((\underline {z}\),t) has Fall-out< Fall-outpop = 1 −Specificity\(_{pop} = \frac {1}{1+k}\). We are thus interested only in the points satisfying inequality \(x < \frac {1}{1+k}\), i.e., left of the vertical straight line \(x = \frac {1}{1+k}\).

If we are interested in the points of the ROC curve that are better than pop for both Recall and Fall-out, then both inequalities must be satisfied, and the evaluation must consider only the points of the ROC curve in ULR\((\frac {1}{1+k},\frac {1}{1+k})\), i.e., the highlighted rectangle in Fig. 2.

It is up to the practitioners and researchers to decide which metrics are of interest for their goals. For instance, they can use FM as PFM1 and NM as PFM2 and pop as both MTD1 and MTD2, to consider defect prediction models that perform better than pop for both FM and NM. The points of the ROC curve to take into account are represented in Fig. 3a, above and to the left of the two oblique straight lines with equations \(y = \frac {k}{2k+1}x + \frac {1}{2k+1}\) for FM and y = (k + 2)x − 1 for NM, as we show in Table 5.

Fig. 3
figure 3

A ROC and the RoI in which FMFMpopNMNMpop (a) and a subset of the ROC space that is not a RoI (b)

Table 5 Borders for performance metrics for random policies

More generally, other metrics and reference policies may be defined and used well beyond the ones illustrated in this paper. Different choices of metrics and reference policies may lead to delimiting any subset of the ROC space. Clearly, if one is interested in using several metrics and several corresponding reference policies, the subset of the ROC space is the intersection of the single subsets, each of which is built by means of a metric and a reference policy.

However, not all ROC space subsets are useful or sensible. We introduce the notion of “Region of Interest,” to define which ones should be used.

Definition 3

Region of Interest (RoI). A subset of the ROC space is said to be a Region of Interest (RoI) if and only if it contains the upper-left rectangles of all of its points, i.e.,

$$ \begin{array}{@{}rcl@{}} \forall\ x,y (x,y) \in RoI \Rightarrow ULR(x,y) \subseteq RoI \end{array} $$
(1)

The border of a RoI is the part of the boundary of the RoI in the interior of the ROC space, i.e., it is the boundary of the RoI without its parts that also belong to the ROC space perimeter. The border of the RoI is the part of its boundary that really provides information on how the RoI is delimited, since the perimeter of the ROC space can be taken for granted as a delimitation.

The union of the light blue and grey regions in Fig. 3a is an example of a RoI, in which FMFMpopNMNMpop (see Section 7.1). An example of a subset of the ROC space that is not a RoI is in Fig. 3b.

RoIs have a few properties, which we prove in Appendix A.

  • The intersection of any number of RoIs is a RoI.

  • The intersection of any number of RoIs is nonempty.

  • A RoI is connected.

  • The smallest RoI to which a point (x,y) belongs is ULR(x,y).

  • A RoI always contains point (0,1).

  • The border of a RoI is a (non necessarily monotonically) increasing function. Thus, graphically, a RoI is above and to the left of its border.

With a reference value derived from a random policy, the points on the border of a RoI correspond to unacceptable defect prediction models, as their performance is as good as what can be expected of a random policy. If, instead, the reference value is selected deterministically, all points on the boundary correspond to acceptable classifiers, e.g., one takes ϕ = 0.4 as minimum ϕ value for an acceptable defect prediction model.

In what follows, we implicitly assume that the points on the border of a RoI are included or not in the RoI depending on whether a reference value has been selected via a random policy or deterministically.

6.1 The Ratio of Relevant Areas

We propose to assess a fault-proneness model fp\((\underline {z})\) via the Ratio of the Relevant Areas (RRA), which takes into account only the RoI selected by a practitioner or researcher.

Definition 4

Ratio of the Relevant Areas (RRA). The Ratio of the Relevant Areas of a ROC curve ROC(x) in a RoI is the ratio of the area of the RoI that is below ROC(x) to the total area of the RoI.

In Fig. 3a, the RoI is the union of the light blue and grey regions, in which the light blue region is the part of the RoI below ROC(x). RRA is the ratio of the area of the light blue region to the area of the RoI.

AUC and G are special cases of RRA, obtained, respectively, when the RoI is the whole ROC space and the upper-left triangle. From a conceptual point of view, it is sounder to consider the area under the portion of the ROC curve in the RoI than to consider the areas under the entire ROC curve taken into account by AUC and G: RoI represents the part of the ROC space in which defect prediction models perform sufficiently well to be used.

Take, for instance, the case in which we use reference random policies of interest along with a set of performance metrics of interest to build a RoI. By considering the parts of ROC(x) outside the RoI, one would also take into account values of t that make \(fn(\underline {z}, t)\) worse than a random estimation method. When we know that a given defect prediction model is worse than a random policy for a set of performance metrics, it is hardly interesting to know precisely how well it performs. However, this is what AUC and G do.

Note that Definition 4 is quite general, as it allows the use of different reference policies for different performance metrics even when they are used together. For instance, one may be interested in the points of a ROC curve that are better than FMpop and, at the same time, better than the NM value obtained with uni with p = 0.7. In what follows, however, we assume that the same reference policy is used for all of the performance metrics selected.

7 RoIs for Specific Performance Metrics and Reference Values

The requirement that a defect prediction model satisfy a minimum acceptable level c for a performance metric PFM corresponds to a RoI in the ROC space. We here show the equations of the borders of the RoIs corresponding to the metrics in Table 2. Appendix B shows how the equations for these borders were obtained, by explaining how these metrics, defined in terms of the cells of the confusion matrix, can be expressed in terms of x and y. These borders are akin to the “iso-performance lines” or “isometrics” proposed in Flach (2003), Provost and Fawcett (2001), and Vilalta and Oblinger (2000).

7.1 RoIs for Performance Metrics with Respect to the Positive and Negative Classes

Table 5 summarizes the formulas about the borders of the RoIs for the performance metrics with respect to the positive and the negative classes of Table 2 with the uni and pop reference policies, for completeness. At any rate, we only use pop in the examples and in the empirical study of Section 10.

  • Column “Formula” provides the definition of the performance metric in each row in terms of x and y. For instance, Precision \(= \frac {y}{kx+y}\).

  • Column “Border” shows the equation of the general straight line that represents the border of the RoI obtained when the performance metric corresponding to the row is given a constant value c. For instance, line \(y = \frac {c}{1-c}kx\) includes the points where Precision=c. For practical usage, when we select a specific metric PFM and a reference method MTD, we replace the generic parameter c by the specific PFMMTD chosen.

  • Column “uni” shows the equation of the border when c is replaced by PFMuni with probability p, where PFM is the metric in the corresponding row. This is the border of the RoI where defect prediction models have greater value of PFM than expected of the uni policy with probability p.

  • Likewise, column “pop” shows the equation of the border when c is replaced by PFMpop, where PFM is the metric in the corresponding row.

It can be shown that each equation in the “uni” column in Table 5 describes a pencil of straight lines (Cremona 2005) through point (p,p), i.e., \((\frac {1}{1+k},\frac {1}{1+k})\) with pop.

Figure 4 shows the ROC curve already shown in Fig. 2, along with all the lines corresponding to the borders mentioned in Table 5 for pop. The portion of ROC above and to the left of the borders of all performance metrics (ULR\((\frac {1}{1+k},\frac {1}{1+k})\), in this case) is quite small, compared to the entire ROC curve. Thus, there is a relatively small range where t provides \(fn(\underline {z}, t)\) defect prediction models that perform better than pop, according to multiple metrics.

Fig. 4
figure 4

ROC curve of fp(WMC) for the berek dataset with multiple pop constraints

The borders in Table 5 follow expected change patterns when k, c, and p change. Higher values of c are associated with stricter constraints, e.g., the slope of the Precision straight line \(y = \frac {c}{1-c}kx\) increases with c. Appendix CC details how these borders behave for each metric when k, c, and p change.

We use the pop policy in the empirical validation of Section 10. For notational convenience, we denote by RoI(PFM1, PFM2, … ) the RoI defined by constraint PFM1 > PFM1,popPFM2 > PFM2,pop ∧…), and by RRA(PFM1, PFM2, … ) the value of RRA for RoI(PFM1, PFM2, … ). For instance, RRA(FM, NM) denotes the value of RRA for RoI(FM, NM), i.e., the RoI with FM> FMpopNM> NMpop, e.g., the union of the light blue and grey regions in Fig. 3a.

7.2 RoIs for Overall Metrics

Table 6 shows the formula and the border obtained for each of the three overall performance metrics PFM in Table 2 when one sets a minimum acceptable value c for it, i.e., one requires PFMc. Unlike Tables 5, and 6 does not contain columns “uni” and “pop,” because we showed in Table 2 that J, Markedness, and ϕ are all equal to 0 under random policies. Therefore, a RoI is defined by means of a deterministically chosen value of c.

Table 6 Borders for overall performance metrics

The lines for constant Youden’s J are straight lines parallel to the diagonal. As for Markedness, it can be shown that the constant lines are parabolas, with symmetry axis \(y = -kx + \frac {1+k}{2}+\frac {k(k-1)}{2c(1+k^{2})}\). The details are in Appendix D.

ϕ has received the most attention among these three metrics in the past. It can be shown that the border for ϕ = c is an ellipse that goes through points (0,0) and (1,1) for all values of c and k and intercepts the perimeter of the ROC space in four points (except for very special values of c and k). Appendix D shows the details of the analytic results we obtained.

Figure 5a shows the ellipse for the berek dataset (from the SEACRAFT repository) with c = 0.4, a value that represents medium/strong association between a defect prediction model and actual faultiness (Cohen 1988). Note that there are two unconnected parts of the ROC space in which ϕc, delimited by the dashed part of the ellipse and the solid part of the ellipse. Based on Definition 3, only the upper-left part above the dotted arc of the ellipse is a legitimate RoI.

Fig. 5
figure 5

ROC curve of defect proneness vs. WMC fp(WMC)for berek dataset with constraints on ϕ

Figure 5b shows the borders of the RoIs associated with ϕ= 0.4 (the lowest line), 0.6, and 0.8 (the highest line). By comparing Figs. 5 and 4, it is easy to see that the points of the ROC curve that satisfy the constraints mentioned in Table 5 also satisfy constraint ϕ ≥ 0.4. However, only a few points of the ROC curve (corresponding to a few selected values of t) satisfy constraint ϕ ≥ 0.6. No point of the ROC curve satisfies constraint ϕ ≥ 0.8.

8 Taking Cost into Account

We have so far considered the evaluation of defect prediction models with respect to the performance of estimates. Though the notion of performance is important, practitioners are usually also interested in other characteristics of estimates, such as the cost of misclassifying a faulty module as not faulty, or vice versa. As we show in Section 8.1, there is a clear relationship between the choice of a performance metric and the cost of misclassification.

We first show how to derive the border of a RoI based on the misclassification cost. Like most of the literature, we assume that each false negative (resp., positive) has the same cost cFN (resp., cFP), so total cost TC is Hand (2009)

$$ \begin{array}{@{}rcl@{}} TC = c_{FN} FN+ c_{FP} FP \end{array} $$
(2)

TC can be computed in terms of x and y of the ROC space as follows

$$ \begin{array}{@{}rcl@{}} TC = c_{FN} AP (1-y)+ c_{FP} AN \cdot x = n \left( c_{FN} \frac{AP}{n} (1-y)+ c_{FP} \frac{AN}{n} x \right) \end{array} $$
(3)

By setting \(\lambda = \frac {c_{FN}}{c_{FN}+c_{FP}}\) and dividing TC by n(cFN + cFP) (which is independent of the defect prediction model used), we can focus on Normalized Cost \(NC = \frac {TC}{n(c_{FN}+c_{FP})}\) (Khoshgoftaar and Allen 1998)

$$ \begin{array}{@{}rcl@{}} NC = \lambda \frac{AP}{n} \left( 1-y \right)+ \left( 1-\lambda \right) \frac{AN}{n} x = \lambda \frac{1}{1+k} \left( 1-y \right)+ \left( 1-\lambda \right) \frac{k}{1+k} x \end{array} $$
(4)

NC is related to Unitary Cost \(UC = \frac {TC}{n}=(c_{FN}+c_{FP})NC\), so constraints on UC get immediately translated into constraints on NC and vice versa.

Usually, cFN is much greater than cFP, as false negatives have more serious consequences than false positives, financially and otherwise. Accordingly, λ is usually much closer to 1 than to 1/2 (value 1/2 corresponds to cFN = cFP).

8.1 Borders Based on Random Policies

For any random policy, thanks to basic properties of expected values, we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E}[NC] = \lambda \frac{AP}{n} \mathbb{E}[\left( 1-y \right)]+ \left( 1-\lambda \right) \frac{AN}{n} \mathbb{E}[x] \end{array} $$
(5)

and for the uni and pop policies we have, based on Table 3

$$ \begin{array}{@{}rcl@{}} \mathbb{E}[NC_{uni}] = \lambda \frac{AP}{n} - p \left( \lambda - \frac{AN}{n} \right) \end{array} $$
(6)
$$ \begin{array}{@{}rcl@{}} \mathbb{E}[NC_{pop}] = \frac{AP \cdot AN}{n^{2}} = \frac{k}{(1+k)^{2}} \end{array} $$
(7)

Thus, one should use only those defect prediction models whose NC is less than the expected normalized cost of a random policy, i.e., \(NC < \frac {AP \cdot AN}{n^{2}}\).

Since \(\mathbb {E}[NC_{pop}]\) is independent of the specific cost per false negative or positive, Formula (7) provides a general result that applies to all defect prediction models, and regardless of the way they have been built, e.g., with or without using techniques based on defect proneness models and thresholds.

The value of \(\mathbb {E}[NC_{pop}] = \frac {AP \cdot AN}{n^{2}} = \frac {AP}{n}-\frac {AP^{2}}{n^{2}}\) depends on the intrinsic characteristics of the dataset. As a function of AP, it describes a parabola with minimum (\(\mathbb {E}[NC_{pop}] = 0\)) when AP = 0 or AP = n and maximum (\(\mathbb {E}[NC_{pop}] = \frac {1}{4}\)) when \(AP = \frac {n}{2}\). Thus, \(NC < \frac {AP \cdot AN}{n^{2}}\) is easier to satisfy when APAN, i.e., for balanced datasets, and more difficult when this is not the case. As we show in Section 8.2, lower values of NC call for better performing defect prediction models.

Via mathematical transformations, it can be shown that the borders of the RoIs that satisfy inequality \(NC < \frac {k}{(1+k)^{2}}\) for different values of λ are represented by the following pencil of straight lines, with center in \((\frac {1}{1+k},\frac {1}{1+k})\)

$$ \begin{array}{@{}rcl@{}} y = \frac{1-\lambda}{\lambda} kx + 1 - \frac{k}{\lambda(1+k)} \end{array} $$
(8)

It can be shown that the slope of the border decreases as λ varies from 0 to 1. Thus, the border rotates around center point \((\frac {1}{1+k},\frac {1}{1+k})\) in a clockwise fashion from vertical straight line \(x = \frac {1}{1+k}\) to horizontal straight line \(y = \frac {1}{1+k}\). A special case occurs when \(\lambda = \frac {k}{1+k}\), since the line is the diagonal y = x.

Recall that all of the straight lines for all of the performance metrics in Table 5 go through center point \((\frac {1}{1+k},\frac {1}{1+k})\) too. Thus, they are special cases of the straight lines described in Formula (8), for specific values of λ. Specifically, we have, in increasing order of the value of λ: for Fall-out, λ = 0; for NM, \(\lambda = \frac {k}{2(1+k)} = \frac {AN}{2n}\); for Precision and for NPV, \(\lambda = \frac {k}{1+k} = \frac {AN}{n}\); for FM, \(\lambda = \frac {2k+1}{2k+2} = \frac {1}{2}+\frac {AN}{2n}\); and for Recall, λ = 1.

Thus, the selection of any of these metrics is not simply an abstract choice on how to assess the performance of defect prediction models, but implies the choice of a specific cost model with a specific \(\lambda = \frac {c_{FN}}{c_{FN}+c_{FP}}\), which implies a specific ratio \(\frac {c_{FN}}{c_{FP}}\).

Based on observations of past projects’ faults and fault removal costs, one could estimate a likely value ke for the ratio AN/AP and a likely range [λl, λu] for λ, to evaluate a given \(fp(\underline {z}, t)\) classifier based on the RoI identified by

$$ \begin{array}{@{}rcl@{}} y \ge \frac{1-\lambda_{l}}{\lambda_{l}} k_{e} x + 1 - \frac{k_{e}}{\lambda_{l}(1+k_{e})} \wedge y \ge \frac{1-\lambda_{u}}{\lambda_{u}} k_{e} x + 1 - \frac{k_{e}}{\lambda_{u}(1+k_{e})} \end{array} $$
(9)

8.2 Cost-reduction RoIs

Practically useful RoIs should represent strict constraints for defect prediction models. For instance, take λ = 0.9, i.e., suppose that a false negative is 9 times as expensive as a false positive. The corresponding line described in Formula (8) for the ROC curve in Fig. 4 is the lowest one in Fig. 6, which does not appear to set a strict constraint and may be of little practical interest.

Fig. 6
figure 6

ROC curve for fp(WMC) for the berek dataset with multiple constraints on normalized cost (λ = 0.9)

To obtain a more useful constraint, a software manager may set a maximum acceptable value for the unitary cost, which translates into a maximum value NCmax for NC. NCmax can be expressed as a fraction μ of \(\mathbb {E}[NC_{pop}]\), i.e., \(NC_{max} = \mu \frac {k}{(1+k)^{2}}\). Constraint NC < NCmax defines a RoI with border

$$ \begin{array}{@{}rcl@{}} y = \frac{1-\lambda}{\lambda} kx + 1 - \mu \frac{k}{\lambda(1+k)} \end{array} $$
(10)

The border depends on parameters k, λ, and μ. We describe in Appendix F the effect of having different values of k, which happens with different datasets. We here investigate the effect of selecting different values of μ, possibly in combination with different values of λ.

Formula (10) shows that μ only influences the intercept of the border. For given values of k and λ, the smaller μ (i.e., the smaller NCmax), the higher the border line. Figure 6 shows the borders for the berek dataset (which has k = 27/16) when λ = 0.9, for μ = 1, μ = 0.75, and μ = 0.5. Formula (8) is a special case of Formula (10) with μ = 1 and it can be easily shown that, for any given value of μ, Formula (10) describes a pencil of straight lines, one for each value of λ, with center in \((\mu \frac {1}{1+k}, 1-\mu \frac {k}{1+k})\). Thus, for any given value of μ, different values of λ have the same effect as we described in Section 8.1.

As μ varies, the center point \((\mu \frac {1}{1+k}, 1-\mu \frac {k}{1+k})\) moves on the straight line y = −kx + 1. It moves upwards and to the left as μ decreases, as expected, since the constraint becomes stricter. When μ tends to 0, then the center point tends to (0,1), which represents perfect classification.

We denote as RRA(λ = cλ, μ = cμ) the value of RRA obtained for the RoI whose border is identified by using λ = cλ and choosing μ = cμ.

The value of μ is related to the performance of defect prediction models quantified by any metric. For instance, take Precision and suppose that a technique for defect prediction models guarantees a minimum value of Precision=c. The border corresponding to Precision=c is \(y=\frac {c}{1-c}kx\) (see Table 5). This straight line intersects line y = −kx + 1 at point \((\frac {1-c}{k}, c)\), which corresponds to \(\mu = \frac {(1-c)(1+k)}{k}\). This is the cost reduction proportion that can be obtained with a technique that improves the value of Precision from Precisionpop to c.

Conversely, suppose we focus on Precision and we plan to achieve a μ cost reduction. The required improvement in Precision is \(c = 1 - \frac {k \mu }{1+k}\).

Similar computations can be carried out for all other performance metrics. The results are in Appendix E.

9 Evaluating RRA

RRA is clearly related to AUC and Gini’s G, so one may wonder if RRA is better than AUC or G, and, if so, to what extent.

First, the main difference between RRA, on one side, and AUC and G, on the other side, is that RRA assesses a defect proneness model only based on the points of a ROC curve corresponding to defect prediction models that are worth evaluating, while AUC and G use all of the points of a ROC curve.

We here assess the extent to which our approach restricts the area of the portion of the ROC space that is taken into account as compared to AUC and G, by computing the areas of two types of RoIs that we have already used in the previous sections and that we also use in the empirical study of Section 10. The area of RoI(Recall, Fall-out) is equal to \(\frac {k}{(k+1)^{2}}\). As a function of k, this area has maximum value \(\frac {1}{4}\), attained for k = 1, i.e., AP = AN. Thus, the computation of RRA(Recall, Fall-out) takes into account at most only one-fourth of the portion of the ROC space taken into account by AUC (whose area is 1), and one-half of the portion taken into account by G (whose area is 0.5). The more k tends to 0 or infinity, the more the area of the RoI shrinks. For instance, for \(k = \frac {AN}{AP} = 4\), the area of the RoI is \(\frac {4}{25} = 0.16\), which corresponds to 16% of the area considered for AUC and 32% of the area considered by G.

The same phenomenon occurs for the area of RoI(FM, NM), which is equal to \(\frac {3k}{(k+2)(2k+1)}\), with maximum value \(\frac {1}{3}\) when k = 1. For k = 4, the value is \(\frac {2}{9} \simeq 0.22\), i.e., 22% of the area considered for AUC and 44% of the area considered by G.

Thus, these two RoIs take into account a rather reduced proportion of the ROC space if compared to AUC or even G. In practice, by considering irrelevant regions, AUC and G incorporate a large “noise” that is higher for projects with a relatively small (or large) defect density. Section 10 shows some results we obtained for these areas in our empirical study, along with results about the proportion of classifiers of ROC curves that fall into our RoIs.

Second, in two cases, the relationships between the values of RRA and of the performance metrics AUC and G are necessarily strong only for models that are either exceptionally good or exceptionally bad. Suppose that a model is so good as to have AUC = 1, then the entire ROC space is under the ROC curve, and both G and RRA are therefore equal to 1. The converse is also true. If RRA= 1, then the ROC curve is above the entire RoIs chosen. Since any RoI includes the perfect estimation (0,1) point, then the ROC curve goes through it, and AUC= 1 and G = 1. For continuity, exceptionally good models that achieve near-perfect estimation are very likely to have values of AUC, G, and RRA close to 1. The empirical study in Section 10 provides some evidence on the values of AUC for which this relationship exists between RRA and the existing metrics AUC and G even in approximate form.

When it comes to bad models, the implication is unidirectional, in general. Suppose that a model is so bad as to have AUC= 0.5 and therefore G = 0, then the ROC curve coincides with the diagonal, which implies that RRA= 0 for all kinds of RRA. However, RRA = 0 does not imply AUC = 0 and G = 0. As a first example, take the ROC curve in Fig. 5b. We have RRA(ϕ = 0.8) = 0, while AUC > 0. As a second example, suppose that the ROC curve goes through point \((\frac {1}{1+k},\frac {1}{1+k})\). Suppose that the value of AUC is greater than 0—which is true except if the ROC curve entirely coincides with the diagonal. We have RRA(Recall, Fall-out)= 0. This happens because the ROC curve never enters the region of interest. Thus, for bad models and, as we see in Section 10, for models that are not exceptionally good, RRA can provide a different perspective and evaluation than AUC and G.

Third, RRA is customizable, since it allows practitioners and researchers to define the set of points of a ROC curve they are interested in, i.e., those in a RoI built by selecting specific performance metrics and reference policies, while this is not possible with AUC and G.

Fourth, the assessment of whether RRA is better than AUC or G requires defining what “better” actually means in this case. RRA, AUC, and G all provide an overall evaluation of the performance of the defect prediction models fn\((\underline {z}\),t) for all values of t. As such, RRA, AUC, and G are aggregate functions of these models. Thus, to assess them, we would need another aggregate function that provides the “ideal” figure of merit against which we can compare the performance of RRA, AUC, and G. Unfortunately, such “ideal” function does not exist or is not known. Total Cost TC would be a suitable “ideal” figure of merit, but it is unknown. That is why metrics like AUC and G were introduced and used in the first place: they would not have been introduced if TC were known. At any rate, RRA can take into account different cost models via parameter λ and different cost requirements via parameter μ in ways that AUC and G can not.

Therefore, the empirical study we present in Section 10 does not and cannot have the goal of showing whether RRA is “better” than AUC and G, or any other metric, for that matter. Rather, we want to show the differences in the assessment of defect proneness models between RRA and the existing ones, and show that RRA can be more reliable and less misleading.

10 Empirical Study

We analyzed 67 datasets for a total of 87,185 modules from the SEACRAFT repository (https://zenodo.org/communities/seacraft). These data were collected by Jureczko and Madeyski (2010) from real-life projects of different types, and have been used in several defect prediction studies (e.g., (Bowes et al. 2018; Zhang et al. 2017)). The number of modules in the datasets ranges between 20 and 23,014 with an average of 1,300, a standard deviation of 3,934, and a median of 241.

For each module (a Java class, in this case), all datasets report data on the same 20 independent variables.Footnote 1 In addition, the datasets provide the number of defects found in each class, which we used to label modules as actually negative and positive. The datasets are fairly different in terms of AP/n ratio, which ranges from 2% to 98.8%. The histogram in Fig. 8 shows the frequency distribution of the proportion of defective modules in the datasets. Though there is a majority of small values in the distributions of n, AP/n, and LOC—as is for software projects in general—fairly large values are well represented (e.g., half of the projects are larger than 59,000 LOC).

For each dataset, we built a BLR model and a Naive Bayes (NB) model using all available measures as independent variables.

Here are the Research Questions that we address in our empirical study.

  • RQ1 To evaluate defect proneness models in practice, to what extent are the regions of the ROC space used by RRA more adequate than the regions used by AUC and G?

    With RQ1, we investigate if, in real-life projects, it is possible to have substantial differences between RRA and the existing performance metrics AUC and G, as these performance metrics take into account different regions of the ROC space and therefore different defect prediction models.

  • RQ2 How frequently are there substantial differences between RRA and traditional performance metrics AUC and G in using more adequate regions of the ROC space?

    By answering RQ2, we check whether RRA is truly useful only in corner cases for extreme projects, or for a large share of the population of projects.

10.1 RQ1: Extent of Adequacy of ROC Space Regions Used

For each of the 67 datasets, the BLR model we obtained had a higher AUC value than the NB model, with only one exception, for the xalan 2.7 project. In what follows, we present the results for the 67 BLR models and also discuss the results for the NB model for the xalan 2.7 project.

To answer RQ1, we consider the interpretation of AUC proposed by Hosmer et al. (2013) and illustrated in Table 4. Accordingly, we split the BLR models we obtained into three classes: acceptable, excellent, and outstanding. Note that we obtained only one BLR model with AUC < 0.7, specifically, with AUC = 0.69. As this value is very close to the 0.7 lower boundary of the acceptable AUC category, we include it here in the acceptable AUC category of models, instead of analyzing it by itself in a separate category. Then, we computed the values of RRA(Recall, Fall-out) and RRA(ϕ = 0.4): Table 7 provides a summary of RRA values for each AUC category.

Table 7 Values of AUC and RRA for AUC ranges for BLR models

The NB model obtained for the xalan 2.7 has AUC = 0.95, RRA(Recall, Fall-out)= 0 and RRA(ϕ = 0.4)= 0.01, while the corresponding BLR model has AUC= 0.69 RRA(Recall, Fall-out)= 0.03 and RRA(ϕ = 0.4)= 0. Note that the values of RRA are very low for both models, even though the two models greatly differ in the values of AUC. Moreover, RRA shows that the apparently outstanding NB model is actually not acceptably accurate.

Let us examine the results of Table 7 for the three categories of AUC values. First, models in the acceptable RRA category should be rejected: they all have RRA(ϕ = 0.4)= 0, and quite low RRA(Recall, Fall-out). Second, models classified as excellent according to AUC have a quite large variability of RRA. Third, models classified as outstanding according to AUC generally have very high values of RRA, although exceptions are possible: the model for jedit 4.3 has AUC= 0.9, but also RRA(Recall, Fall-out)= 0.35 and RRA(ϕ= 0.4)= 0.02 (noticeably, jedit 4.3’s dataset is characterized by AP/n= 0.02) and the NB model for xalan 2.7 even more extreme values.

To better understand the relationship between AUC and RRA, we split the excellent AUC range into two sub-ranges, and also split the resulting model sets according to AP/n. Specifically, we split the model set with excellent AUC into those having AUC ∈ (0.8,0.87] and those having AUC ∈ (0.87,0.9]; threshold 0.87 was chosen to have enough models in each subset. As for the 25 models with outstanding AUC, 16 were perfect prediction classifiers, with AUC= 1. We put them in a separate category from those with AUC∈ (0.9,1).

The results are in Table 8, where “mid” indicates values between 0.2 and 0.8 of AP/n, i.e., in the middle part of the [0,1] interval of AP/n and “l/h” indicates values of AP/n that are either in the low or high range, i.e., less than 0.2 or greater than 0.8. By “any,” we indicate that we did not split the models based on AP/n.

Table 8 Values of AUC and RRA for BLR models and AP/n ranges

Table 8 shows that BLR models with AUC in the (0.8,0.87] range and “l/h” values for AP/n mostly have low values of RRA: e.g., the median RRA(ϕ = 0.4) is just 0.07 for these models. Models with AUC in the (0.87,0.9] range and mid values of AP/n have instead higher values of RRA. The other models are characterized by variable RRA, hence they should be evaluated individually.

As for models with outstanding AUC values, Table 8 obviously confirms that, as noted in Section 9, models with AUC= 1 also have RRA= 1. The models with AUC∈ (0.9,1) have generally high RRA(Recall, Fall-out) and RRA(ϕ= 0.4), even though the median of RRA(ϕ= 0.4)= 0.55 is a bit over half the value of RRA(ϕ= 0.4) of the models with AUC= 1. There are two important exceptions, as noted above, i.e., the BLR model for jedit 4.3 with outstanding AUC= 0.9 and extremely poor RRA(ϕ= 0.4)= 0.02 and the NB model for xalan 2.7 with outstanding AUC= 0.95 and extremely poor RRA(ϕ = 0.4)= 0.01.

To provide additional evidence, Fig. 7 shows the values of AUC and RRA(ϕ= 0.4) for the models obtained from datasets with AP/n ≤ 0.2 or AP/n ≥ 0.8.

Fig. 7
figure 7

AUC—red continuous line—and RRA(ϕ = 0.4)—blue dashed line—for projects with \(\frac {AP}{n}\le 0.2\) or \(\frac {AP}{n}\ge 0.8\)

Together, Table 8 and Fig. 7 indicate that the strong relationship between the values of AUC and RRA we described in Section 9 only holds when AUC is extremely close to 1, but no longer exists for values of AUC that are considered excellent or even outstanding.

Thus, in response to RQ1, we can observe that RRA appears to take into account a more useful region of the ROC space than AUC and G in the evaluation of a defect prediction model. As a consequence, RRA provides more realistic evaluations than traditional performance metrics, which provide unreliable evaluations, under some conditions.

10.2 RQ2: Frequency of Using more Adequate Regions

To answer RQ2, we need to check whether low or high values of defect density AP/n are frequent or rare. Figure 8 shows the distribution of defect density values (rounded to the first decimal) of the projects we considered. Quite a large share (31 projects out of 67) have AP/n ≤ 0.2, i.e., a fairly small defect density, while 2 out of 67 have AP/n ≥ 0.8, i.e., a very high defect density.

Fig. 8
figure 8

Distribution of defect density in the datasets by Jureczko and Madeyski

Thus, at least for the considered datasets, for about one half of the models, traditional indicators are bound to provide responses based on very large portions of ROC curves where predictions are worse than random.

Also, notice that the range of “l/h” values of AP/n is 40% of the entire AP/n range (i.e., [0,1]), but accounts for 49% of datasets, i.e., projects are more concentrated in these subintervals of AP/n, for which larger portions of AUC are meaningless.

To obtain additional quantitative insights, we performed some additional analysis. We computed the areas of RoI(Recall, Fall-out) for the analyzed datasets. We found that the mean area of RoI(Recall, Fall-out) is 0.164, while the median is 0.173, and the standard deviation is 0.067. 49% of datasets have RoI(Recall, Fall-out) whose area is ≤ 0.16, i.e., more than 84% of the ROC space used to compute AUC and 68% of the area above the diagonal considered to compute G are in the region representing classifiers that are random or worse than random.

Similarly, we computed the areas of RoI(FM, NM): the mean area is 0.22, the median is 0.24, and the standard deviation is 0.09. 49% of datasets have \(\textit {RoI}(FM, NM)\!\le \!0.24\overline {2}\), i.e., more than 75% of the ROC space used to compute AUC and 52% of the area above the diagonal considered to compute G are in the region representing classifiers that are random or worse than random (\(0.24\overline {2}\) is the area or RoI(FM, NM) when AP/n is 0.2 or 0.8).

In conclusion, for a large share of datasets, evaluations based on AUC or G are largely based on regions of the ROC space that should not be considered.

However, one may suppose that the classifiers represented by a ROC curve may be concentrated mostly in RoIs such as RoI(Recall, Fall-out). If this is the case, then some of the issues related to using defect prediction models that have performance worse than random policies may be alleviated. Thus, we investigated how many points in the ROC curves we found are outside RoI(Recall, Fall-out). We found that all the models have over 50% of the classifiers out of the RoI, and 64% of the models have over 2/3 of the classifiers out of the RoI.

Finally, Fig. 9 shows boxplots comparing the distributions of AUC, G, RRA(Recall, Fall-out) and RRA(ϕ = 0.4). Figure 9a concerns all the models: it can be seen that AUC provides quite optimistic evaluations: except for just one case, AUC is greater than 0.7, with mean and median well above 0.8. On the contrary, the values of RRA(Recall, Fall-out) are more widely spread, indicating that RRA(Recall, Fall-out) discriminates models more severely and realistically. Values of RRA(ϕ = 0.4) are even more widely spread, with lower mean and median. Figure 9b provides the same comparison, considering only the models of the 33 out of 67 datasets having “l/h” AP/n. It is noticeable that the distribution of AUC changes only marginally, with respect to Fig. 9a; on the contrary, the distributions of RRA(Recall, Fall-out) and RRA(ϕ = 0.4) center on lower values, showing that the indications by AUC are far too optimistic.

Fig. 9
figure 9

Comparison of the distributions of AUC, G, RRA(Recall, Fall-out) and RRA(ϕ= 0.4)

Based on the collected evidence, we can answer RQ2 by stating that RRA indicators are useful quite frequently. On the contrary, it appears that AUC indications are seldom reliable.

11 The Software Engineering Perspective

The meaning of RRA is the same as the meaning of AUC and G, at a high level: all these indicators provide an evaluation of the performance of a defect proneness model. Accordingly, RRA can be applied just like AUC and G. However, as discussed in Section 7, RRA can be adapted to specific needs and goals. For instance, performance evaluation can be based on ϕ or on FM and NM. Accordingly, the meaning of RRA is more specific than the meaning of AUC or G. We now outline how our proposal can be used during software development.

11.1 Defect Prediction Model Selection

Suppose that the software manager of a software project, e.g., ivy 2.0, needs to use a defect prediction model based on several measures. The considered model appears reasonably good, since it has AUC= 0.87, RRA(FM,NM)= 0.42, RRA(Recall, Fall-out)= 0.39, RRA(λ = 0.9,μ = 0.5)= 0.22, RRA(ϕ = 0.4)= 0.10.

Figure 10 shows the corresponding ROC curve and RoI(Recall, Fall-out). When it comes to choosing a defect proneness threshold to build a defect prediction model, the software manager realizes that

  1. 1.

    by selecting the models corresponding to points close to (0,1), i.e., those on the dotted curve in Fig. 10, as is often suggested, one obtains a model with Fall-out worse than random estimations;

  2. 2.

    by selecting the model corresponding to the tangent point on the highest isocost line that touches the curve, i.e., those touched by the dashed line in Fig. 10, (as suggested in Powers (2011), for instance), one chooses a model whose Fall-out is worse than random estimations.

Fig. 10
figure 10

ROC curve and RoI(Recall, Fall-out) of the LOC-based model for ivy 2.0

Thus, conventional wisdom concerning the position of the best fault prediction model in the ROC space is not always reliable. On the contrary, the RoI highlighted in Fig. 10 suggests where useful thresholds should be chosen from.

11.2 Defect Proneness Model Selection

Suppose that a software manager has two defect proneness models (for instance, built with different modeling techniques) and needs to decide which one to use. AUC provides a spurious assessment of the performance of the defect proneness models, because it is based even on parts of the ROC space that are not of interest for the software manager. As Fig. 1b shows, two ROC curves ROC1(x) and ROC2(x) can intersect each other in such a way that the final selection is based mostly on parts of the ROC space that should not be considered.

However, by focusing only on the RoI, the software manager may find that ROC1(x) is predominantly (or even always) above ROC2(x) in the RoI, so the defect proneness model corresponding to ROC1(x) should be preferred to the one corresponding to ROC2(x) when building a defect prediction model. Thus, the software manager can make a better informed (and even easier) decision as to which defect proneness model to use.

11.3 Assessing Costs and Benefits of Additional Measures

Suppose a software manager is in charge of a software project in which no real systematic code measurement process is in place, but only data on modules’ size (expressed in LOC) and defectiveness are currently available. The software manager builds a LOC-based defect proneness model fp(LOC) and uses AUC to decide whether the model’s performance is good enough. For instance, suppose this was the case of the xerces 1.4 project. The value of AUC for fp(LOC) is 0.75, which is right in the middle of the acceptable range. However, if the project manager also checks performance with the RRA metrics, the values RRA(Recall, Fall-out) = 0.2 and RRA(ϕ = 0.4) = 0.0006 show that fp(LOC) actually has much poorer performance that AUC would indicate.

Thus, the project manager may decide that more measures are needed to build better performing models and start a systematic code measurement collection process and finally get a fp model based on multiple code measures. Suppose that the total set of measures obtained after implanting the new measurement process is the one in the datasets by Jureczko and Madeyski. The BLR model that uses all of them has very high RRA(Recall, Fall-out)= 0.77 and RRA(ϕ = 0.4)= 0.73. Thus, this model has much better performance than the LOC-based model and therefore higher trustworthiness.

However, there may be costs associated with establishing a systematic measurement program, which need to be weighed against the benefits of having better performing models. For instance, if additional measures can be had by building or buying an automated tool that analyzes software code, the software manager incurs a one-time cost. If, instead, the collection itself of the measures requires the use of resources, then there is a cost associated with every execution of the measurement program. As an example, suppose that a well-performing model requires the knowledge of the Function Points associated with the software system. Counting Function Points can be quite expensive (Jones 2008; Total Metrics 2007). Thus, by using RRA -based metrics, the software manager can have a better assessment of the costs and benefits due to an expanded measurement program.

11.4 Adaptability to Goals

Unlike with AUC and G, software managers can “customize” RRA for their projects and goals. The value of λ to be used in RRA(λ,μ) is derived from the unit costs cFN and cFP. Thus, RRA(λ,μ) depends on the characteristics of a project, as different projects have different values of cFN and cFP and therefore of λ. So, RRA(λ,μ) takes into account costs more precisely than AUC and G can for a specific project.

Parameter μ is related to the project goals, since it is the desired proportion of unitary cost reduction that determines the maximum unitary cost (see Section 8.2). Using RRA(λ,μ) allows software managers to restrict the selection of a defect prediction model only among those defect proneness models for which RRA(λ,μ) > 0. This kind of selection cannot be carried out by using AUC and G, which may actually be misleading, as shown in Section 10.

In addition, in Section 8.2, we showed that there is a relationship between performance metrics and the cost reduction proportion μ that can be achieved. Poorly performing models imply low levels of cost reduction and, conversely, high levels of cost reduction imply the need for well-performing models. This may call for building better models, as shown, for instance, in Section 11.3.

Software Defect Prediction researchers can use our proposal to have a more precise assessment of the quality of defect prediction models. Like software managers, they can also customize RRA for their goals. For instance, they can use RRA(ϕ) to have a general assessment of models or focus on specific performance metrics by using RRA(FM,NM), for instance, or any other ones. The variety of ways in which RRA can be defined and used goes beyond what has already been defined and used to delimit the part of the ROC curve to consider in other fields such as medical and biological disciplines (see the review of the literature in Section 13.1).

12 Threats to Validity of the Empirical Study

We here address possible threats to the validity of our empirical study, which we used to demonstrate our proposal and provide further evidence about it.

Some external validity threats are mitigated by the number of real-life datasets and the variety in their characteristics such as application domains, \(\frac {AP}{n}\) ratio, number of modules, and size, as indicated in Section 10.

The values of AUC and G are computed, according to common practice, based on the training set used to build defect prediction models; similarly, to compute RRA, we took the training set as the test set too. Using a different test set than the training set may change the value of \(\frac {AP}{n}\). This concept drift would affect the RoIs to be taken into account (e.g., RoI(Recall, Fall-out)), and, therefore, the value of RRA (e.g., RRA(Recall, Fall-out)). Thus, we may have obtained different results with different test sets than the training sets.

As for the construct validity of RRA, based on Sections 58, Section 9 shows that RRA specifically addresses some of the construct validity issues related to AUC and G, which are widely used in Software Defect Prediction and several other disciplines.

Construct validity may be threatened by the performance metrics used. For instance, FM has been widely used in the literature, but it also has been largely criticized (Shepperd et al. 2014). We also used Precision, Recall, Specificity, NPV, NM, and ϕ, to have a more comprehensive picture and set of constraints based on different perspectives. At any rate, our approach is not limited to any fixed set of performance metrics, and any other may be used as well, as long as it is based on confusion matrices.

Also, we used BLR and NB because of the reasons explained in Section 10. Other techniques may be used, but the building of models is not the goal of this paper: we simply needed models for demonstrative purposes.

13 Related Work

Given the importance of defect prediction in Software Engineering, many studies have addressed the definition of defect proneness models. They are too many to mention here. ROC curves have been often used to evaluate defect proneness models, as reported by systematic literature reviews on defect prediction approaches and their performance (Arisholm et al. 2010; Beecham et al. 2010a; Hall et al. 2012).

There has been an increasing interest in ROC curves in the Software Defect Prediction and, more generally, Empirical Software Engineering literature in the last few years. For instance, 82 papers using ROC curves appeared in in the 2007-2018 period in three major Software Engineering publication venues, namely, “IEEE Transactions on Software Engineering,” “Empirical Software Engineering: an International Journal,” and the “International Symposium on Empirical Software Engineering and Measurement,” while no papers using ROC curves appeared before 2007.

We here first describe and discuss proposals for performance metrics that have appeared in the general literature on ROC curves analysis, to address some of the issues related to the adoption of AUC (Section 13.1).

Then, we review a few of the related works published in the Empirical Software Engineering literature, to provide an idea of what kind of work has been done with ROC curves (Section 13.2).

Also, we show in Section 13.3 how cost modeling can be addressed by our approach even with different cost models than the one we use in Section 8.

13.1 ROC Curve Performance Metrics

A few proposals define performance metrics that take into account only portions of a ROC. These approaches define various forms of a partial AUC metric (pAUC), which has also been implemented in the R package pROC, available at https://web.expasy.org/pROC/ (Robin et al. 2011).

McClish (1989) computes pAUC as the area under ROC(x) in an interval [x1,x2] between two specified values of x. To compare the performance of digital and analog mammography tests, Baker and Pinsky (2001) compare the partial AUCs for the two different ROC curves in an interval between two small x values. For mammography-related applications too, Jiang et al. (1996) propose a different version of partial AUC, in which they only take into account the high-recall portion of the ROC space, in which \(y > \bar {y}\), where \(\bar {y}\) is a specified value of y. They define a metric as the ratio of the area under ROC(x) and above \(y = \bar {y}\) to the area above \(y = \bar {y}\), i.e., \(1-\bar {y}\). Dodd and Pepe (2003) introduce a nonparametric estimator for partial AUC, computed based on an interval of x, like in McClish (1989), or based on an interval of y. All four papers carry out further statistical investigations (e.g., the definition of statistical tests for comparing the areas under different ROC curves), based on statistical assumptions.

McClish also defines a “standardized” version of pAUC that takes into account only the part of the vertical slice in the [x1,x2] interval that is also above the diagonal, i.e., the trapezoid delimited by the diagonal y = x, the vertical lines x = x1 and x = x2, and the horizontal line y = 1. Specifically, the standardized metric is based on the ratio (which we call here pG) between, on one hand, the area under the curve and above the diagonal and, on the other hand, the area of the trapezoid. The standardized metric is then defined as \(\frac {1}{2}\left (1+pG\right )\). Note that pG coincides with the partial Gini index that was defined along the same lines in Pundir and Seshadri (2012), by computing the normalized value of Gini’s G in an interval [x1,x2], and was used in Lessmann et al. (2015) with x in the interval [0,0.4]. Like between AUC and G, there is a relationship between pAUC and pG. It can be easily shown analytically that \(pAUC = pG\left (1-\bar {x}\right )+\bar {x}\), where \(\bar {x} = \frac {x_{1}+x_{2}}{2}\) is the midpoint of the [x1,x2] interval.

These approaches and ours share the idea that the evaluation of ROC(x) can be of interest with respect to portions of the curve. However, the portion of the ROC space taken into account is either a vertical slice or trapezoid or a horizontal slice of the ROC space, and does not take different forms, like the ones that are possible with our approach, e.g., the ones depicted in Figs. 3a and 5. Also, these portions of the ROC space are not necessarily RoIs according to Definition 3. Thus, there are some RoIs that are not taken into account by these approaches, and vice versa. The reason lies in the goals of these proposals and ours. Our goal is to identify the classifiers whose performance is better than some reference values. The other approaches aim to limit the set of interesting values of either Recall or Fall-out. So, they may take into account classifiers whose performance is worse than reference values, e.g., those obtained via random classifiers.

The meaning of AUC has also been investigated by Hand (2009), who finds that computing AUC is equivalent to computing an average minimum misclassification cost with variable weights. Specifically, Hand finds that the expected minimum misclassification cost is equal to \(2 \frac {AP \cdot AN}{n^{2}}\)(1-AUC) when the values of cFN and cFP are not constant, but depend on the classifier used. This poses a conceptual problem, since cFN and cFP should instead depend on the software process costs and the costs related to delivering software modules with defects, and not on the classifier used. Our proposal partially alleviates Hand’s issue, by delimiting and reducing the set of classifiers taken into account, so the variability of the values of cFN and cFP is reduced. Hand also defines an alternative metric, H, which, however, relies on two assumptions: (1) cFN + cFP and \(\lambda = \frac {c_{FN}}{c_{FN}+c_{FP}}\) are statistically independent; (2) the probability density function of λ used for the computation of \(\mathbb {E}[TC]\) is a Beta distribution with specified parameters. As for the second assumption, Hand discards the use of a uniform distribution because it would treat extreme and more central values of λ as likely. Hand also advocates the use of different functions that can be more closely related to practitioners’ and researchers’ goals, if available. Hand shows how H can be estimated. Given the correspondence between straight lines and cost models, the identification of a RoI with our proposal may help delimit the set of the values of λ to take into account.

Other performance metrics for ROC curves are described in Swets and Pickett (1982). These metrics are all based on a binormal ROC space, in which the abscissa and the ordinate represent the normal deviates obtained from x and y. One of the recommended metrics is Az, which is the area under the curve of a transformed ROC curve in the binormal ROC space. Since a border can be transformed into a line in the binormal ROC space, our RoIs can help take into account only the portion of the transformed ROC curve that is relevant, and compute only the area under that portion. Other metrics are \(d^{\prime }\), \(d^{\prime }_{e}\), and Δm, which are generalized by da, which represents the distance of a transformed ROC curve from the origin of the binormal ROC space. Also, Swets and Pickett (1982) mention metric β, which can be computed based on costs and benefits of positive and negative observations, which can only be subjectively assessed.

Papers (Flach 2003; Provost and Fawcett 2001; Vilalta and Oblinger 2000) define “iso-performance lines” or “isometrics,” i.e., those ROC space lines composed of classifiers with the same value for some specified performance metric. Our proposal uses those lines as borders for RoIs and shows how to derive them starting from random policies, to delimit RoIs.

13.2 ROC Curves and AUC in Empirical Software Engineering

ROC curves have been used in Empirical Software Engineering for the assessment of models for several external software attributes (Fenton and Bieman; Morasca 2009).

Here are just a few recent examples of the variety of ways in which ROC curves have been used to assess defect prediction models: Di Nucci et al. (Nucci et al. 2018) use ROC curves and AUC for models based on information about human-related factors; McIntosh and Kamei (2018) for change-level defect prediction models; Nam et al. (2018) for heterogeneous defect prediction; Herbold et al. (2018) to assess the performance of cross-project defect prediction approaches.

As for other external software attributes: Kabinna et al. (2018) use ROC curves and AUC to assess the change proneness of logging statements; da Costa et al. (2018) to study in which future release a fixed issue will be integrated in a software product; Murgia et al. (2018) to assess models for identifying emotions like love, joy, and sadness in issue report comments; Ragkhitwetsagul et al. (2018) to evaluate code similarity; Arisholm et al. (2007) to assess the performance of predictive models obtained via different techniques to identify parts of a Java system with a high probability of fault; Dallal and Morasca (2014) to evaluate module reusability models; Posnett et al. (2011) to study the risk of having fallacious results by conducting studies at the wrong aggregation level; Cerpa et al. (2010) to evaluate models of the relationships linking variables and factors to project outcomes; Malhotra and Khanna (2013) to assess change proneness models.

ROC curves (with and without AUC) have also been used in Empirical Software Engineering studies to find optimal thresholds t for fn\((\underline {z}\),t) to build a defect prediction model. For instance, Shatnawi et al. (2010) use AUC to quantify the strength of the relationship between a variable z and defect proneness. The threshold selected by Tosun and Bener (2009) corresponds to the ROC curve point at minimum Euclidean distance from the ideal point (0,1), which represents perfect estimation. The threshold selected by Sánchez-González et al. (2012) corresponds to the farthest point from the ROC diagram diagonal (see also Mendling et al. (2012)).

13.3 Cost Modeling

The cost related to the use of defect prediction models has been the subject of several studies in the literature that focused on misclassification costs (Hand 2009; Jiang and Cukic 2009; Khoshgoftaar and Allen 1998; Khoshgoftaar et al. 2001; Khoshgoftaar and Seliya 2004), or used cost curves (Drummond and Holte 2006; Jiang et al. 2008). A recent paper () defines a cost model based on the idea that a defect may affect several modules and a module may be affected by several defects. Herbold’s cost model also allows the use of different Verification and Validation costs for different modules, different costs for different defects, and different probabilities that Verification and Validation activities miss a defect.

Here we provide a detailed discussion for the cost model investigated by Zhang and Cheung (2013). Specifically, Zhang and Cheung use the overall cost of a prediction model Cp = cFP(TP + FP) + cFNFN, which includes cFPTP, and derive two inequalities that must be satisfied by the confusion matrix of a defect prediction model. We now show how this cost model can be studied in our approach, by defining the inequalities proposed by Zhang and Cheung as RoI borders.

The first inequality is derived by comparing the value of Cp obtained with a binary classifier and the value obtained by trivially estimating all modules positive. When \(\lambda \neq \frac {1}{2}\), the equation of the first border is

$$ \begin{array}{@{}rcl@{}} y-1 = \frac{(1-\lambda)k}{2\lambda-1} (x-1) \end{array} $$
(11)

This is a pencil of straight lines going through point (1,1). A straight line from this pencil defines an effective border (i.e., the straight line is in the upper-left triangle) if and only if its slope is between 0 and 1, i.e., \(0 < \frac {(1-\lambda )k}{2\lambda -1} < 1\). When \(\lambda > \frac {1}{2}\), the slope is nonnegative and it can be shown that the slope is less than 1 if and only if \(\frac {c_{FP}}{c_{FN}} < \frac {AP}{n}\). When \(\lambda < \frac {1}{2}\), the slope is negative, and the straight lines are outside the ROC space. When \(\lambda = \frac {1}{2}\), the equation of border is x = 1, which is not an effective border in the ROC space.

The second inequality is derived by comparing the value of Cp obtained with a defect prediction model and the value obtained with the uni random policy. The second inequality, however, turns out to always set the diagonal y = x as the border straight line, so it does not introduce any real constraints.

At any rate, any cost model based on the cells of the confusion matrix related to a defect prediction model can be dealt with by our approach.

14 Conclusions and Future Work

In this paper, new concepts and techniques are proposed to assess the performance of a given defect proneness model fp\((\underline {z})\) when building defect prediction models. The proposed assessment is based on two fundamental concepts: 1) only models that outperform reference ones—including, but not limited to, random estimation—should be considered, and 2) any combination of performance metrics can be used. Concerning the latter point, not only do we allow researchers and practitioners to use the metric they like best (e.g., F-measure or ϕ), but we introduce the possibility of evaluating models against cost, and we show that there is a clear correspondence between performance metrics (such as Recall, Precision, F-measure, etc.) and cost.

Using the proposed technique, practitioners and researchers can identify the thresholds worth using to derive defect prediction models based on a given defect proneness model, so that the obtained models perform better than reference ones. Our approach helps practitioners evaluate competing defect prediction models and select adequate ones for their goals and needs. It allows researchers to assess software development techniques based only on those defect prediction models that should be used in practice, and not on irrelevant ones that may bias the results and lead to misleading conclusions.

Unlike the traditional AUC metric, which considers the entire ROC curve, our approach considers only the part of the ROC curve where performance—evaluated via the metrics of choice—is better than reference performance values, which can be provided by reference models, for instance.

We show that RRA—when used with suitable areas of interest, like those that exclude random behaviors—is theoretically sounder than traditional ROC-based metrics (like AUC and Gini’s G). The latter are special cases or RRA, but computed on areas that include worse than random classifiers.

We also applied RRA, G, and AUC to models obtained from 67 real-life projects, and compared the obtained indications. RRA appeared to provide much deeper insight into the actual performance of models.

RRA proved more adequate than AUC and G in capturing the information used in the evaluation of defect prediction models. Specifically, AUC and G appeared to consider a large amount of information pertaining to random (and worse) performance conditions. As a consequence, AUC and G often reported high performance levels, while the performance of the corresponding models was much lower. In these cases, RRA provided much more realistic indications, revealing these low performance levels. Our analysis showed also that AUC and G can be quite frequently misleading.

Although in the empirical validation (Section 10) only measures taken on modules were used as independent variables, other types of measures—e.g., process measures—could be used in exactly the same way. That is, our approach is applicable to a broader class of models than those considered in this paper.

As a further generalization, our approach can be used outside Software Defect Prediction. What is needed is a scoring function, so that a ROC curve can be built. Thus, if such a model for, say, availability is known, it can be used in our approach in exactly the same way as defect proneness models.

Even more generally, the approach can be conceptually used for any kind of scoring function, so it can be used in disciplines beyond Empirical Software Engineering. Also, the approach can be used with other kinds of constraints that can be set on scoring function. For instance, one can also set a constraint on the value of the first derivative of the scoring function, as we did in our previous work (Morasca and Lavazza 2017) to define risk-averse thresholds for defect proneness models.

Future work will be needed to provide more evidence about the usefulness and the limitations of the approach, including

  • the assessment of the approach on more datasets

  • the use of additional performance metrics, to have a more complete idea of the performance of the classification

  • a more in-depth study of the characteristics of RRA, for instance, by introducing statistical tests to check whether the differences in the values of RRA of different ROC curves are statistically significant

  • the investigation of other techniques for obtaining an overall assessment of a defect proneness model

  • the investigation of other cost models, such as the one recently introduced by ()

  • the application to other external attributes that can be quantified by means of probabilistic models (Krantz et al. 1971; Morasca 2009), e.g., maintainability, usability, reusability.